chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.19k stars 1.27k forks source link

[Install issue]: Own Embedding Function #1496

Open CeeArEx opened 10 months ago

CeeArEx commented 10 months ago

What happened?

I just try to use my own embedding function. This is what i got:

from chromadb import Documents, EmbeddingFunction, Embeddings from typing_extensions import Literal, TypedDict, Protocol from typing import Optional, Sequence, Union, TypeVar, List, Dict, Any, Tuple, cast

Embeddable = Union[Documents] D = TypeVar("D", bound=Embeddable, contravariant=True)

class EmbeddingFunction(Protocol[D]): def call(self, input: D) -> Embeddings: embeddings = [1,2,3] return embeddings

collection = client.create_collection(name="testing" , embedding_function=EmbeddingFunction)

But i got always... this error:

ValueError: Expected EmbeddingFunction.call to have the following signature: odict_keys(['self', 'input']), got odict_keys(['self', 'args', 'kwargs']) Please see https://docs.trychroma.com/embeddings for details of the EmbeddingFunction interface. Please note the recent change to the EmbeddingFunction interface: https://docs.trychroma.com/migration#migration-to-0416---november-7-2023

I looked up in the migration tab, but this doesn't helped. https://docs.trychroma.com/migration

Maybe someone can help me, i searched a lot, ireinstalled , checked version ... but nothing worked for me.

Versions

chromadb = 0.4.18

Relevant log output

No response

HammadB commented 10 months ago

Can you share the code for your embedding function? Likely the signature is just wrong - we can help debug.

CeeArEx commented 10 months ago

I just put a request inside it. Because i can send the text to my server and it will return the embedding vectors.

It like this:

url = 'localhost:1234' myobj = {'text': input}

x = requests.post(url, json = myobj)

input is my text and i just want to return my x (my numbers).

I'm relatively new to this field, sorry for that.

HammadB commented 10 months ago

No worries, Could you share the python code you are extending EmbeddingFunction with?

IronSpiderMan commented 10 months ago

you loss the (), it's should be: ` collection = client.create_collection(name="testing", embedding_function=EmbeddingFunction())

`

CeeArEx commented 10 months ago

you loss the (), it's should be:

`

collection = client.create_collection(name="testing", embedding_function=EmbeddingFunction())

`

If i do this: ... = client.create_collection(name='testing', embedding_function=EmbeddingFunction())

I got this Error:

TypeError: Protocols cannot be instantiated

CeeArEx commented 10 months ago

No worries, Could you share the python code you are extending EmbeddingFunction with?

Just the line i mentioned above:

embeddings = requests.post("localhost:1234", json = input)

I use this (see link below) in the background and use just requests to send my text to the endpoint.

https://github.com/abetlen/llama-cpp-python

IronSpiderMan commented 10 months ago

requests

you need extended from EmbeddingFunction,just like this:

import chromadb
from chromadb import Documents, Embeddings, EmbeddingFunction
from typing import Optional, Sequence, Union, TypeVar, List, Dict, Any, Tuple, cast

Embeddable = Union[Documents]
D = TypeVar("D", bound=Embeddable, contravariant=True)

class CustomEmbeddingFunction(EmbeddingFunction):
    def call(self, input: D) -> Embeddings:
        embeddings = [1, 2, 3]
        return embeddings

client = chromadb.Client()
collection = client.create_collection(name="testing", embedding_function=CustomEmbeddingFunction())

and your Custom embedding function should use another name.

CeeArEx commented 10 months ago

requests

you need extended from EmbeddingFunction,just like this:


import chromadb

from chromadb import Documents, Embeddings, EmbeddingFunction

from typing import Optional, Sequence, Union, TypeVar, List, Dict, Any, Tuple, cast

Embeddable = Union[Documents]

D = TypeVar("D", bound=Embeddable, contravariant=True)

class CustomEmbeddingFunction(EmbeddingFunction):

    def call(self, input: D) -> Embeddings:

        embeddings = [1, 2, 3]

        return embeddings

client = chromadb.Client()

collection = client.create_collection(name="testing", embedding_function=CustomEmbeddingFunction())

and your Custom embedding function should use another name.

That seems to work. Thank you very much. :)

dinonovak commented 10 months ago

I created my own embedding function as suggested above:

Fix for huggingface embeddings and chroama version

ImageDType = Union[np.uint, np.int, np.float] Image = NDArray[ImageDType] Images = List[Image]

Images = List[Image] Embeddable = Union[Documents, Images]

D = TypeVar("D", bound=Embeddable, contravariant=True)

class CustomEmbeddingFunction(EmbeddingFunction): def call(self, input: D) -> Embeddings: embeddings = HuggingFaceEmbeddings( model_name="bert-base-multilingual-uncased" ) return embeddings

I am assigning it to collection: chroma_collection = chroma_client.get_or_create_collection(name=f"BasicRag", embedding_function=CustomEmbeddingFunction())

I am adding entries to db chroma_collection.add( documents=products_list, ids=ids )

and I can see that they are added, but when trying to search it I am only getting empty record result, what am I doing wrong

query = "products for heavy duty use"

embeddings = HuggingFaceEmbeddings( model_name="bert-base-multilingual-uncased" )

text_embeddings = embeddings.embed_query(query)

results = chroma_collection.query( query_embeddings=[text_embeddings], n_results=10, include=["documents"] )

IronSpiderMan commented 10 months ago

Your embedding function is wrong, your call method return embeddings model itself, you should return the embedding of the input.

by the way, you shouldn't create the embedding model in the call method, This consumes resources.

this is a example:

class VitEmbeddingFunction(EmbeddingFunction):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
        self.model = ViTModel.from_pretrained("google/vit-base-patch16-224")

    def __call__(self, images: Documents) -> Embeddings:
        images = [Image.open(image) for image in images]
        inputs = self.processor(images, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs)
            last_hidden_state = outputs.last_hidden_state
        return last_hidden_state[:, 0, :].numpy().tolist()
IronSpiderMan commented 10 months ago

in newest version of chromadb, the param of call method should be input.