dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.62k stars 175 forks source link

LanceDB destination with Ollama embedding #1973

Open zilto opened 3 weeks ago

zilto commented 3 weeks ago

Documentation description

User contacted me with the following question:

how can I swap openai embeddings for an ollama model nomic-embed-text. While ollama isn't mentioned as supported in the docs, the last bullet point suggests ollama could work.

As far as I know, the LanceDB destination leverages the LanceDB registry and should support Ollama. The docs could more explicitly mention Ollama support (or not) and show how to set it up. In particular, what's the config key to set the Ollama server URL.

Are you a dlt user?

Yes, I'm already a dlt user.

Analect commented 3 weeks ago

@Pipboyguy .... Just adding some colour here pertaining to this slack thread. I had been in touch with @zilto ... since it was originally some questions around his blog.

I wanted to be able to swap-out his openai usage for calculating embeddings on a self-hosted ollama embeddings model and just have them calculated locally.

I tried with these settings.

# using  Ollama
os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL_PROVIDER"] = "ollama"
os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL"] = "nomic-embed-text"

This errored with:

PipelineStepFailed: Pipeline execution failed at stage sync with exception:

<class 'dlt.common.configuration.exceptions.ConfigValueCannotBeCoercedException'>
Configured value for field embedding_model_provider cannot be coerced into type typing.Literal['gemini-text', 'bedrock-text', 'cohere', 'gte-text', 'imagebind', 'instructor', 'open-clip', 'openai', 'sentence-transformers', 'huggingface', 'colbert']

Is that somehow due to ollama missing from this list: https://github.com/dlt-hub/dlt/blob/devel/dlt/destinations/impl/lancedb/configuration.py#L50-L62

Also, since I'm running my ollama at host 192.168.192.3:11434, it would be great to be able to pass through an ollama host that doesn't just default to localhost:11434.

Perhaps that needs handling for something like os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL_PROVIDER_HOST"] = "<some-url>:11434"

akelad commented 2 weeks ago

@Pipboyguy I saw you were having some more discussions on slack about this - did you already start working on this? We had assigned Rahul to look into it, but if you're already on it I'll unassign it.

Pipboyguy commented 2 weeks ago

Hi @akelad . I haven't started on this. Rahul can have a go!

Pipboyguy commented 3 days ago

@zilto @Analect Turns out that the ollama provider has been added in the devel branch, and I tested it so seems to be working just fine. Please try again with:

pip install "git+https://github.com/dlt-hub/dlt.git@devel[lancedb]"

Once this PR is merged you should be able to also specify your ollama host with

os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL_PROVIDER_HOST"] = "http://192.168.192.3:11434"

Don't forget the protocol