chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
13.4k stars 1.14k forks source link

Get a list of unique metadata fields #2357

Closed blacksmithop closed 2 weeks ago

blacksmithop commented 2 weeks ago

Let's say I have a metadata field named "Field", I wish to fetch a document each for every value of Field. If Fields were "A", "B", and "C" I would have 3 documents.

Effectively I wish to get a list of unique values for a metadata field without manually iterating over them. Related SO Thread

client = chromadb.PersistentClient(path=CHROMA_DIR)

collection = client.get_collection("project_store_all")
unique_keys = collection.get(where_document={"$distinct" : "title"}, where={"title": "state_of_the_union"}) #not working as expected
print(unique_keys)

I tried the following query but it seems I cannot pass a distinct to where_document

tazarov commented 2 weeks ago

@blacksmithop, Chroma does not have (yet) any aggregation functions like in regular relational or NoSQL DBs. Therefore, you will have to iterate over the collection to generate the unique values.

Here's a sample code to do that:

import chromadb

client = chromadb.PersistentClient(path="uniq_metadata")

col = client.get_or_create_collection("metadata")

col.upsert(ids=["0","1","2","4"], documents=["doc 1","dc 2","doc 3","doc without metadata"],metadatas=[{"name":"name1","value":"value1"},{"name":"name2","value":"value2"},{"name":"name3","value":"value3"},None])

unique_values={m['name'] for m in col.get(where={"name":{"$ne":""}},include=['metadatas'])['metadatas'] if 'name' in m}
print(unique_values)

You may have to adjust the '{"name":{"$ne":""}}' expression to meet your need (the one above only assumes that you are interested in the metadata field name that doesn't have values).

Additionally, if your collection is large you may want to paginate results with limit and offset in the get().

blacksmithop commented 2 weeks ago

I see, thank you!