dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.38k stars 154 forks source link

LanceDB destination: can't query generated tables #1765

Closed zilto closed 3 weeks ago

zilto commented 1 month ago

dlt version

0.5.3

Describe the problem

Problem

Vector search on a LanceDB table generated by dlt is broken. Here's a simple query:

import os
import lancedb

dlt_lancedb_uri = os.environ["DESTINATION__LANCEDB__CREDENTIALS__URI"]
lancedb_con = lancedb.connect(dlt_lancedb_uri)
lancedb_table = lancedb_con.open_table("my_table")
query = lancedb_table.search("My very important question")
results = query.to_list()

Expected behavior

Since I expect downstream users to import lancedb and not worry about how data is ingested, dlt should adopt a different strategy for embedding function registration:

  1. Add to the docs the need to import dlt.destinations.impl.lancedb.models before querying data from lancedb
  2. Have dlt/destinations/impl/lancedb/__init__.py import dlt/destinations/impl/lancedb/models.py. This should enable dlt.destinations.lancedb to be sufficient (I believe?)
  3. Avoid needing a custom PatchedOpenAIEmbeddings and rely on natively-supported lancedb functions
  4. Collaborate with lancedb to modify the stored pyarrow.Schema's metadata to include the required module imports (i.e., the dlt submodule). This would add the embedding function to the LanceDB registry at deserialization before trying to retrieve the function from the registry,

This error was painful to debug, because nothing points to dlt being the source. Renaming the registered function to openai_dlt_patch would be of great help

Steps to reproduce

  1. Configure the lancedb destination (credentials, embedding function, etc.)
  2. Use the lancedb destination with the lancedb_adapter to ingest data
  3. In a separate process (script, notebook, REPL, etc.), import lancedb only and access a generated table that has an embed column.
  4. Query the table using lancedb's .search() (vector search)
  5. It should fail saying "openai_patched" is not in registry

Fix

  1. import dlt.destinations.impl.lancedb.models
  2. retry steps 4 and 5 and it should now be working

By manually checking the LanceDB embedding function registry, you can see the "openai_patched" function defined by dlt being registered.

import lancedb.embeddings.registry as embedding_registry_module

registry = embedding_registry_module.get_registry()
registry._functions  # dictionary of {func_name: func} of type Dict[str, Callable]
registry.get("openai_patched")

dlt code: https://github.com/dlt-hub/dlt/blob/devel/dlt/destinations/impl/lancedb/models.py lancedb code: https://github.com/lancedb/lancedb/blob/main/python/python/lancedb/embeddings/registry.py

Operating system

Linux

Runtime environment

Local

Python version

3.11

dlt data source

Not relevant

dlt destination

No response

Other deployment details

dlt destination is lancedb

Additional information

No response