chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.67k stars 1.22k forks source link

Adding lists to the metadata #227

Open Everminds opened 1 year ago

Everminds commented 1 year ago

Hi, We find ourselves having the need to save lists in the metadata (example, we are saving a slack message and want to have in the metadata all the users that are mentioned in the message) And we want the search to be able to filter by this field to see if some value is in the list (e.g. find me all slack messages that a specific user was mentioned in) It would be great to have support for this Thanks!

jeffchuber commented 1 year ago

@Everminds hello! Is this information that you store outside of chroma as well? If so, I have another idea for a solution here.

mangate commented 1 year ago

We can save it outside though it would be less convenient

Everminds commented 1 year ago

@jeffchuber any updates on this one?

8rV1n commented 1 year ago

I would vote for this, it will be very useful if it supports the list directly and we won't need 3rd tool to retrieve all the vectors and compare again.

It will be helpful for scenarios like we get a doc describing a thing but with different versions, models, etc.

jeffchuber commented 1 year ago

@8rV1n you want to be able to pass an allowlist of ids to query, right?

that is underway :) https://github.com/chroma-core/chroma/pull/384

8rV1n commented 1 year ago

@8rV1n you want to be able to pass an allowlist of ids to query, right?

that is underway :) #384

Thanks @jeffchuber , I guess not just IDs, widening it to metadata would be great!

To clarify it:

I understand this would mean a lot of effort, but see below for how it helps:

An example scenario: Say I have a web page, but it is rapidly updating like weekly. The ID could be just some randomly generated UUIDs but it has a label illustrating the week number. So, if it supports the list, then we will be able to narrow down the range by filter like weeks 20-50.

Similarly, you may change the "web page" to "products" of an online shopping site, we normally filter things with many options like price, category, shipping preference, seller, etc. We want to get a similar result by the product detail(content), and we also want to filter it using things we are familiar with so that we can make it more efficient.

jeffchuber commented 1 year ago

@8rV1n chroma has this :) though we currently do a bad job communicating it

https://github.com/chroma-core/chroma/blob/a5637002e4599e8b9e78db8e7be0cdb380942673/chromadb/test/test_api.py#L1050

look inside that test folder and you will see examples of all of these. The where filter in get will work with query as well

8rV1n commented 1 year ago

@8rV1n chroma has this :) though we currently do a bad job communicating it

https://github.com/chroma-core/chroma/blob/a5637002e4599e8b9e78db8e7be0cdb380942673/chromadb/test/test_api.py#L1050

look inside that test folder and you will see examples of all of these. The where filter in get will work with query as well

Thanks @jeffchuber!

Any idea for using metadata like this? (adding, and querying)

collection.add(
    documents=["Alice meets rabbits...", "doc2", "doc3", ...],
    metadatas=[{"charactor_roles": ['Alice', 'rabbits']}, {"charactor_roles": ['Steve Jobs', 'Tim Cook']}, {"charactor_roles": []}, ...],
    ids=["id1", "id2", "id3", ...]
)

It seems I can do this for the metadata when creating the collection:

client.create_collection(
    "my_collection", 
    metadata={"foo": ["bar", "bar2"]}
)
pbarker commented 1 year ago

+1, this would be incredibly useful for not needing a secondary datastore to just to be able to attach lists to documents

Russell-Pollari commented 1 year ago

Happy to take a stab at this.

If I'm understanding correctly, this would mean adding List as an allowed value in Metadata

-Metadata = Mapping[str, Union[str, int, float]]
+Metadata = Mapping[str, Union[str, int, float, List[Union[str, int, float]]]]

So that lists can be added as a value in metadata:

collection.add(ids=['test'], documents=['test'], metadatas=[{ 'list': [1, 2, 3] }])

The biggest source of uphill work, I think, would be adding support for Lists to the Where filter operators

EDIT: Should we re-use existing operators and make them them work for lists? e.g.

collection.get(where={ "list": {  "$eq": 2 } })

or create new operators for lists? e.g.

collection.get(where={"list": { "$contains": 2 } })
jeffchuber commented 1 year ago

@Russell-Pollari yes that is correct!

Where operator support is definitely the biggest lift here.

I think $in and $notin (or the better named version of those) is probably the minimal case...

Russell-Pollari commented 1 year ago

@jeffchuber

IMO $in and $nin imply that I should supply an array to filter against. They would be useful operators for all types.

I think it would be better UX to have $eq and $ne also work with lists (effectively as $contains or $notContains when appropriate)

But I'm definitely pattern matching to MongoDB's query operators here. This is how they do it:

I managed to get working prototype for filtering arrays with $eq for duckdb:

            # Shortcut for $eq
            if type(value) == str:
                result.append(
                    f""" (
                        json_extract_string(metadata, '$.{key}') = '{value}'
                        OR
                        json_contains(json_extract(metadata, '$.{key}'), '\"{value}\"')
                    )
                    """
                )
            if type(value) == int:
                result.append(
                    f""" (
                        CASE WHEN json_type(json_extract(metadata, '$.{key}')) = 'ARRAY'
                        THEN
                        list_has(CAST(json_extract(metadata, '$.{key}') AS INT[]), {value})
                        ELSE
                        CAST(json_extract(metadata, '$.{key}') AS INT) = {value}
                        END
                    )
                    """
                )
            if type(value) == float:
                result.append(
                    f""" (
                        CASE WHEN json_type(json_extract(metadata, '$.{key}')) = 'ARRAY'
                        THEN
                        list_has(CAST(json_extract(metadata, '$.{key}') AS DOUBLE[]), {value})
                        ELSE
                        CAST(json_extract(metadata, '$.{key}') AS DOUBLE) = {value}
                        END
                    )
                    """
                )
jeffchuber commented 1 year ago

@Russell-Pollari indexing against how mongo does it is definitely a good idea!

@HammadB what do you think?

Russell-Pollari commented 1 year ago

Threw up a PR, let me know what you think!

If my solution works for y'all, happy to also update the JS client and the docs

jeffchuber commented 1 year ago

@Russell-Pollari thanks! will take a look today :)

tyatabe commented 1 year ago

Hey, I'm also interested in using this functionality, I have documents with a bunch of possible tags as metadata, for example

Document(page_content='lorem impsum ...',
metadata={
'id': '5f874c6591bc3f9a540c3722',
'title': 'hello world',
'tags': 'tag1, tag2, tag3, etc'
}
)

If I could use the $contains operator I could filter for specific tags. Right now I'm trying turning all the tags into binary values, but I think that's breaking chroma somehow

jeffchuber commented 1 year ago

but I think that's breaking chroma somehow

:( can you share more about what is breaking? this should work. are they true/false or 1/0?

tyatabe commented 1 year ago

Hey, I wasn't sure it could handle booleans or ints, so I ended up turning them into strings '0'/'1'. The error I got was from clickhouse (I'm using with a chroma server), I think it was related to the size of the query being to big, as I also have a cloud server where I got a 413 error. I ended up looping over the documents and that solved the issue, so I'm guessing that having so many metadata fields makes the documents to big to be handled by clickhouse? (not really sure how it all works though)

jeffchuber commented 1 year ago

@tyatabe gotcha. there was a max_query_size issue people had run into with clickhouse. We are removing clickhouse now and that should fix up this sort of sharp edge.

Russell-Pollari commented 1 year ago

Exploring the new SQLite implementation.

My naive approach would look something like this, having tables for int str and float

     def _insert_metadata(self, cur: Cursor, id: int, metadata: UpdateMetadata) -> None:
         """Insert or update each metadata row for a single embedding record"""
-        t = Table("embedding_metadata")
+        t, str_list, int_list, float_list = Tables(
+            "embedding_metadata",
+            "embedding_metadata_string",
+            "embedding_metadata_int",
+            "embedding_metadata_float",
+        )
         q = (
             self._db.querybuilder()
             .into(t)
             .columns(t.id, t.key, t.string_value, t.int_value, t.float_value)
         )
         for key, value in metadata.items():
+            if isinstance(value, list):
+                if isinstance(value[0], str):
+                    for val in value:
+                        q_str = (
+                            self._db.querybuilder()
+                            .into(str_list)
+                            .columns(str_list.metadata_id, str_list.value)
+                            .insert(ParameterValue(id), ParameterValue(val))
+                        )
+                if isinstance(value[0], int):
+                    for val in value:
+                        q_int = (
+                            self._db.querybuilder()
+                            .into(int_list)
+                            .columns(int_list.metadata_id, int_list.value)
+                            .insert(ParameterValue(id), ParameterValue(val))
+                        )
+                if isinstance(value[0], float):
+                    for val in value:
+                        q_float = (
+                            self._db.querybuilder()
+                            .into(float_list)
+                            .columns(float_list.metadata_id, float_list.value)
+                            .insert(ParameterValue(id), ParameterValue(val))
+                        )
             if isinstance(value, str):
                ...
                 q = q.insert(
                     ParameterValue(id),

Does this make sense? @jeffchuber @HammadB

Russell-Pollari commented 1 year ago

Update: got a hacky prototype for list[int]. Should be straightforward to generalize to other types

(branched off of https://github.com/chroma-core/chroma/pull/781 for my working dir)

Migration for new table:

CREATE TABLE embedding_metadata_ints (
    id INTEGER REFERENCES embeddings(id),
    key TEXT REFERENCES embedding_metadata(key),
    int_value INTEGER NOT NULL
);

Inserting metadata with list chromadb/segment/impl/metadata/sqlite.py

    def _insert_metadata(self, cur: Cursor, id: int, metadata: UpdateMetadata) -> None:
        """Insert or update each metadata row for a single embedding record"""
        (
            t,
            int_t,
        ) = Tables(
            "embedding_metadata",
            "embedding_metadata_ints",
        )
        q = (
            self._db.querybuilder()
            .into(t)
            .columns(t.id, t.key, t.string_value, t.int_value, t.float_value)
        )
        for key, value in metadata.items():
            if isinstance(value, list):
                q = q.insert(
                    ParameterValue(id),
                    ParameterValue(key),
                    None,
                    None,
                    None,
                )
                if isinstance(value[0], int):
                    q_int = (
                        self._db.querybuilder()
                        .into(int_t)
                        .columns(int_t.id, int_t.key, int_t.int_value)
                    )
                    for val in value:
                        q_int = q_int.insert(
                            ParameterValue(id), ParameterValue(key), ParameterValue(val)
                        )
                    sql, params = get_sql(q_int)
                    sql = sql.replace("INSERT", "INSERT OR REPLACE")
                    if sql:
                        cur.execute(sql, params)

            if isinstance(value, str):
             ...

Querying for list of ints (SqliteMetadataSegment.get_metadata)

    def get_metadata
....
        embeddings_t, metadata_t, fulltext_t, int_t = Tables(
            "embeddings",
            "embedding_metadata",
            "embedding_fulltext",
            "embedding_metadata_ints",
        )

        q = (
            (
                self._db.querybuilder()
                .from_(embeddings_t)
                .left_join(metadata_t)
                .on(embeddings_t.id == metadata_t.id)
                .outer_join(int_t)
                .on((metadata_t.key == int_t.key) & (metadata_t.id == int_t.id))
            )
            .select(
                embeddings_t.id,
                embeddings_t.embedding_id,
                embeddings_t.seq_id,
                metadata_t.key,
                metadata_t.string_value,
                metadata_t.int_value,
                metadata_t.float_value,
                int_t.int_value,
            )

constructing metadata object with list of ints

    def _record(self, rows: Sequence[Tuple[Any, ...]]) -> MetadataEmbeddingRecord:
        """Given a list of DB rows with the same ID, construct a
        MetadataEmbeddingRecord"""
        _, embedding_id, seq_id = rows[0][:3]
        metadata = {}
        for row in rows:
            key, string_value, int_value, float_value, int_elem = row[3:]
            if string_value is not None:
                metadata[key] = string_value
            elif int_value is not None:
                metadata[key] = int_value
            elif float_value is not None:
                metadata[key] = float_value
            elif int_elem is not None:
                int_list = metadata.get(key, [])
                int_list.append(int_elem)
                metadata[key] = int_list

Also requires updating the relevant types/validators to allow for lists

Russell-Pollari commented 1 year ago

Converging on a solution

Initially, I created tables for each allowed list type (int, str, float). It was working but was getting messy.

Ended up using another table with the same schema as embedding_metadata, which let me reuse a lot of existing functions

CREATE TABLE embedding_metadata_lists (
    id INTEGER REFERENCES embeddings(id),
    key TEXT REFERENCES embedding_metadata(key),
    string_value TEXT,
    float_value REAL,
    int_value INTEGER
);
    @override
    def get_metadata(
        self,
        where: Optional[Where] = None,
        where_document: Optional[WhereDocument] = None,
        ids: Optional[Sequence[str]] = None,
        limit: Optional[int] = None,
        offset: Optional[int] = None,
    ) -> Sequence[MetadataEmbeddingRecord]:
        """Query for embedding metadata."""

        embeddings_t, metadata_t, fulltext_t, metadata_list_t = Tables(
            "embeddings",
            "embedding_metadata",
            "embedding_fulltext",
            "embedding_metadata_lists",
        )

        q = (
            (
                self._db.querybuilder()
                .from_(embeddings_t)
                .left_join(metadata_t)
                .on(embeddings_t.id == metadata_t.id)
                .left_outer_join(metadata_list_t)
                .on(
                    (metadata_t.key == metadata_list_t.key)
                    & (embeddings_t.id == metadata_list_t.id)
                )
            )
            .select(
                embeddings_t.id,
                embeddings_t.embedding_id,
                embeddings_t.seq_id,
                metadata_t.key,
                metadata_t.string_value,
                metadata_t.int_value,
                metadata_t.float_value,
                metadata_list_t.string_value,
                metadata_list_t.int_value,
                metadata_list_t.float_value,
            )
            ...

If this approach makes sense, can you assign this issue to me, @jeffchuber? I just about have a shippable PR with tests (old and new) passing.

Buckler89 commented 1 year ago

Hi @Russell-Pollari , can you explain how those changes will impac the usage of the chorma from a user point of view?

My use case is the following: Each item in the database is tagged using the appropriate key (in my case it's "tags"). I would like to pre-filter the query results based alson on the tags. Let's say we have 3 documents: the first has tags = [iot, business, machine] the second has tags = [iot, business, support] the third has tags = [iot]

I would like to pre-filter the data getting only the items that for example have "iot" and "business" as tags.

Using the already present syntax (using-logical-operators) it could be something like this:

where={
       "$and": [
           {
               "tags": {
                   $contains: "iot"
               }
           },
           {
               "tags": {
                   $contains: "business"
               }
           }
       ]
  }

The same apply for &or operetor.

Russell-Pollari commented 1 year ago

@Buckler89 That's the intended use case for this feature! Supporting lists to embed metadata, and allow uses to filter based on those lists. I have a working local branch implementing this.

I'll likely push a PR this week once the Chroma team merges their big SQLite refactor.

jeffchuber commented 1 year ago

needs to integrate fairly tightly with the need to create custom indices...

PeterTF656 commented 9 months ago

Dear all, this issue came back in python 0.4.20. @jeffchuber

collection.add(
    documents=[x["metadata"]["summary"] for x in data],
    embeddings=embeds_2.embeddings,
    metadatas=[x['metadata'] for x in data],
     ids=[x['uid'] for x in data]
)

where data is a list of object, each object is like this:

{
        "uid": string,
        "field1": string,
        "field2": string[],
        "metadata": {
            "field1": string[],
            "field2": number[],
            "field4": string,
        }
    },

The error is:

ValueError                                Traceback (most recent call last)
Cell In[107], [line 1](vscode-notebook-cell:?execution_count=107&line=1)
----> [1](vscode-notebook-cell:?execution_count=107&line=1) collection.add(
      [2](vscode-notebook-cell:?execution_count=107&line=2)     documents=[x["metadata"]["summary"] for x in data],
      [3](vscode-notebook-cell:?execution_count=107&line=3)     embeddings=embeds_2.embeddings,
      [4](vscode-notebook-cell:?execution_count=107&line=4)     metadatas=[x['metadata'] for x in data],
      [5](vscode-notebook-cell:?execution_count=107&line=5)      ids=[x['uid'] for x in data]
      [6](vscode-notebook-cell:?execution_count=107&line=6) )

File [d:\dev2.0\deep-processing\.venv\Lib\site-packages\chromadb\api\models\Collection.py:146](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:146), in Collection.add(self, ids, embeddings, metadatas, documents, images, uris)
    [104](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:104) def add(
    [105](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:105)     self,
    [106](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:106)     ids: OneOrMany[ID],
   (...)
    [116](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:116)     uris: Optional[OneOrMany[URI]] = None,
    [117](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:117) ) -> None:
    [118](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:118)     """Add embeddings to the data store.
    [119](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:119)     Args:
    [120](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:120)         ids: The ids of the embeddings you wish to add
   (...)
    [136](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:136) 
    [137](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:137)     """
    [139](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:139)     (
    [140](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:140)         ids,
...
    [277](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/types.py:277)             f"Expected metadata value to be a str, int, float or bool, got {value} which is a {type(value)}"
    [278](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/types.py:278)         )
    [279](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/types.py:279) return metadata

ValueError: Expected metadata value to be a str, int, float or bool, got ['901123200'] which is a <class 'list'>
ivanol55 commented 7 months ago

Is this still on the roadmap? I'm trying to add a collection of "keywords" for each article I am storing and this seems like it'd be needed for that (I could also be architecturing this wrong myself...)