chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
13.4k stars 1.14k forks source link

[Bug]: cohere embedding ValueError #2258

Open yusuf8834 opened 1 month ago

yusuf8834 commented 1 month ago

What happened?

Hi

my sample code and output are below. I get error as shown in the output for cohere embeddings, it works normally for sentence transformers.

i will appreciate any help.

thank you.

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="db4")

em = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="sentence-transformers/distiluse-base-multilingual-cased-v1")

em = embedding_functions.CohereEmbeddingFunction(
    api_key="---",
    model_name="embed-multilingual-v3.0")

collection = client.get_or_create_collection(name="abdurrahim",
                                             embedding_function=em)

collection.add(documents=[i["Sual_unchanged"] for i in df_dict[:100]],
               metadatas=[{
                   'Kitab': item['Kitab'],
                   'Cevab': item['Cevab']
               } for item in df_dict[:100]],
               ids=[str(i["index"]) for i in df_dict[:100]])

Versions

chroma ver. 5 python 3.11 / 3.10 win11

Relevant log output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 1
----> 1 collection.add(documents=[i["Sual_unchanged"] for i in df_dict[:100]],
      2                metadatas=[{
      3                    'Kitab': item['Kitab'],
      4                    'Cevab': item['Cevab']
      5                } for item in df_dict[:100]],
      6                ids=[str(i["index"]) for i in df_dict[:100]])

File c:\Users\ysf\Desktop\Gits\Fetva_NLP\Fetva_NLP_Django\.venv\Lib\site-packages\chromadb\api\models\Collection.py:154, in Collection.add(self, ids, embeddings, metadatas, documents, images, uris)
    151 if embeddings is None:
    152     # At this point, we know that one of documents or images are provided from the validation above
    153     if documents is not None:
--> 154         embeddings = self._embed(input=documents)
    155     elif images is not None:
    156         embeddings = self._embed(input=images)

File c:\Users\ysf\Desktop\Gits\Fetva_NLP\Fetva_NLP_Django\.venv\Lib\site-packages\chromadb\api\models\Collection.py:633, in Collection._embed(self, input)
    628 if self._embedding_function is None:
    629     raise ValueError(
    630         "You must provide an embedding function to compute embeddings."
    631         "https://docs.trychroma.com/embeddings"
    632     )
--> 633 return self._embedding_function(input=input)

File c:\Users\ysf\Desktop\Gits\Fetva_NLP\Fetva_NLP_Django\.venv\Lib\site-packages\chromadb\api\types.py:194, in EmbeddingFunction.__init_subclass__.<locals>.__call__(self, input)
    192 def __call__(self: EmbeddingFunction[D], input: D) -> Embeddings:
    193     result = call(self, input)
--> 194     return validate_embeddings(maybe_cast_one_to_many_embedding(result))

File c:\Users\ysf\Desktop\Gits\Fetva_NLP\Fetva_NLP_Django\.venv\Lib\site-packages\chromadb\api\types.py:488, in validate_embeddings(embeddings)
    484     raise ValueError(
    485         f"Expected embeddings to be a list with at least one item, got {len(embeddings)} embeddings"
    486     )
    487 if not all([isinstance(e, list) for e in embeddings]):
--> 488     raise ValueError(
    489         "Expected each embedding in the embeddings to be a list, got "
    490         f"{list(set([type(e).__name__ for e in embeddings]))}"
    491     )
    492 for i, embedding in enumerate(embeddings):
    493     if len(embedding) == 0:

ValueError: Expected each embedding in the embeddings to be a list, got ['tuple']
tazarov commented 1 month ago

@yusuf8834, This appears to be an issue with the EF. I'll have a look.

Do you mind tell me which version of cohere lib are you using - pip list | grep cohere

yusuf8834 commented 1 month ago

cohere 5.5.3

thanks

anantguptadbl commented 1 month ago

I was able to recreate this error. Working on a fix. It needs a processing of the output from the coehere embed function

tazarov commented 1 month ago

@yusuf8834, indeed it does. Cohere have updated how they generate their client, and now I see that there are two possible outputs for embeddings, one with floats and another with types. The latter one doesn't comply with work well with the Cohere EF:

https://github.com/cohere-ai/cohere-python/blob/457f5d7f2014e5ed0886e7901f8b21bcf65a6895/src/cohere/types/embed_by_type_response.py#L13

We'll fix it shortly.

anantguptadbl commented 1 month ago

@tazarov there is a change in cohere response structure from major version 5.x.x. Will chromadb provide support for older versions of cohere <= 4.x

tazarov commented 1 month ago

@anantguptadbl, fixed with backward compatibility.

yusuf8834 commented 1 month ago

@anantguptadbl, fixed with backward compatibility.

Thank you for the fix. how can use this version without waiting for the next version release of chromadb

tazarov commented 1 month ago

@yusuf8834, you can run:

pip install git+https://github.com/chroma-core/chroma.git@main