datasette / datasette-embeddings

Store and query embedding vectors in Datasette tables
Apache License 2.0
4 stars 0 forks source link

Support local models for generating embeddings #15

Open psychemedia opened 5 days ago

psychemedia commented 5 days ago

The datasette-embeddings extension currently requires the use of hosted OpenAI models and the availability of an OpenAI API key to generate embeddings:

    async def calculate_embedding(cls, api_key, text, model):
        # Add dimensions for models called things that end in -xxx digits
        body = {
            "input": text,
            "model": model,
        }
        last_bit = model.split("-")[-1]
        if last_bit.isdigit():
            body["model"] = "-".join(model.split("-")[:-1])
            body["dimensions"] = int(last_bit)

        async with httpx.AsyncClient() as client:
            response = await client.post(
                "https://api.openai.com/v1/embeddings",
                headers={
                    "Content-Type": "application/json",
                    "Authorization": f"Bearer {api_key}",
                },
                json=body,
            )
            response.raise_for_status()
            embedding = response.json()["data"][0]["embedding"]
            return embedding

It would be useful to allow the user to specify a local model without the need for an API key.

This could be done minimally, or build on the llm package, which has support for generating embeddings (docs) from local models via the llm-sentence-transformers extension.

For datasette-lite, it would also be useful to be able to make use of browser machinery to use a wasm packaged model to generate the embeddings. The anywidget framework provides a way of wrapping js/wasm packages so that they can be called from python code running in UIs running in VSCode and browser-based environments (jupyter, marimo) which might be a sensible way of integrating wasm powered function calls into that datasette-lite environment.

psychemedia commented 4 days ago

Ah, comperhnsive docs for working with llm embeddings tools using Py here: https://llm.datasette.io/en/stable/embeddings/python-api.html