chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
13.36k stars 1.14k forks source link

[Bug]: Getting a Value Error when using the HuggingFace embedding function #2422

Open tomersagi opened 3 days ago

tomersagi commented 3 days ago

What happened?

Hi, I am trying to use a custom embedding model using the huggingfaceAPI. I am following the instructions from here

However, when I try to use the embedding function I get the following error:

Traceback (most recent call last):
  File "C:\Users\OT48ZK\AppData\Local\Programs\PyCharm Professional\plugins\python\helpers-pro\pydevd_asyncio\pydevd_asyncio_utils.py", line 117, in _exec_async_code
    result = func()
             ^^^^^^
  File "<input>", line 1, in <module>
  File "C:\Users\OT48ZK\PycharmProjects\retrieval-er\venv\Lib\site-packages\chromadb\api\types.py", line 198, in __call__
    return validate_embeddings(maybe_cast_one_to_many_embedding(result))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\OT48ZK\PycharmProjects\retrieval-er\venv\Lib\site-packages\chromadb\api\types.py", line 507, in validate_embeddings
    raise ValueError(
ValueError: Expected each value in the embedding to be a int or float, got an embedding with ['list'] - [[[0.21432682871818542, -0.11559132486581802, ...

Minimal example:

import chromadb
import chromadb.utils.embedding_functions as emb

chroma_client = chromadb.PersistentClient(path='mehdie.db')
huggingface_ef = emb.HuggingFaceEmbeddingFunction(model_name='google-bert/bert-base-multilingual-cased', api_key='hf_...')

val = huggingface_ef(['Washington'])

Versions

Chroma 0.5.3 Python 3.11

Relevant log output

Traceback (most recent call last):
  File "C:\Users\OT48ZK\AppData\Local\Programs\PyCharm Professional\plugins\python\helpers-pro\pydevd_asyncio\pydevd_asyncio_utils.py", line 117, in _exec_async_code
    result = func()
             ^^^^^^
  File "<input>", line 1, in <module>
  File "C:\Users\OT48ZK\PycharmProjects\retrieval-er\venv\Lib\site-packages\chromadb\api\types.py", line 198, in __call__
    return validate_embeddings(maybe_cast_one_to_many_embedding(result))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\OT48ZK\PycharmProjects\retrieval-er\venv\Lib\site-packages\chromadb\api\types.py", line 507, in validate_embeddings
    raise ValueError(
ValueError: Expected each value in the embedding to be a int or float, got an embedding with ['list'] - [[[0.21432682871818542, -0.11559132486581802, ...
tomersagi commented 2 days ago

ok, I understand the problem now. The embedding model I am using is returning a k x F tensor, with k being the number of tokens in the query phrase and F being the number of features. The chroma huggingface embedding function is expecting a 1xF tensor only. To solve it I had to subclass the embedding function and add a mean pooling step.

Perhaps the documentation and error message can be improved here to describe the types of models this embedding function supports.

tazarov commented 2 days ago

@tomersagi, you are right that the naming is a bit misleading. Under the hood, we use sentence-transformers. Technically, it also works with transformer models only and defaults to mean pooling, and without normalization.

We can do better by letting the user know that the model they are loading is not a sentence-transformer one, which may produce unsupported output.