chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.72k stars 1.23k forks source link

[Bug]: Querying large number of entries returns #2762

Open MatthewWiens101 opened 3 weeks ago

MatthewWiens101 commented 3 weeks ago

What happened?

Possible duplicate of #1861

When running the following on a Chroma database (where query_embeddings is a list of embeddings around 20 elements long, and embeddings have length 1024, and n_results is about 8000):

results = collection.query(
    query_embeddings=query_embeddings,
    n_results=n_results,
    include=["metadatas", "distances"],
    where={
        "$and": [
            {
                "timestamp": {
                    "$gte": start_timestamp
                }
            },
            {
                "timestamp": {
                    "$lt": end_timestamp
                }
            },
        ]
    },
)

I get sqlite3.OperationalError: too many SQL variables:

ChromaDB_error1

This error goes away if the number of embeddings in the list is reduced to 1. An alternative would be to iterate over querying for each embedding in the list, but this is extremely slow.

We can see that the error occurs here in segmentation.py, and is the result of not chunking this query to the database. I have tested the same scenario after changing the implementation in segment.py to use chunking and everything runs fine and is quite fast. I will create a PR with this resolution.

Versions

Chroma v0.5.5, Python 3.9.2, Debian 11

Relevant log output

No response