[Bug]: Too strict type requirement on EmbeddingFunction

LumenYoung commented 11 months ago

What happened?

Dear team on Chroma,

Strictly speaking, this is not a bug rather the constraint chroma imposed. However, I find that chroma has a too strict EmbeddingFunction typing which is not suitable for my usecase and it prevents me from doing the proper thing.

In short, chroma only allow a single D type variable (either image or Docuement) to be pasted to the EmbeddingFunction. However, I'm using both image and text to create a joint embedding with llava model. So this restriction makes it very hard to use the chroma vector store.

The following is my definition for the embedding function, which makes total sense if you were trying to generate an embedding from multiple modalities.

class LlavaEmbedding(EmbeddingFunction):
    def __call__(
        self, texts: Documents, images: Optional[List[Dict[str, str]]] = None
    ) -> Embeddings:
        # embed the documents somehow

        if images is None:
            embeddings = [get_embedding_from_llava(text)[0] for text in texts]

        assert len(texts) == len(images), "text and image length must be equal"
        embeddings = []
        for text, imgs in zip(texts, images):
            embeddings.append(get_embedding_from_llava(text, file=imgs))

        return embeddings

But the typing is too strict for this kind of proper usage:

class EmbeddingFunction(Protocol[D]):
    def __call__(self, input: D) -> Embeddings:
        ...

def validate_embedding_function(
    embedding_function: EmbeddingFunction[Embeddable],
) -> None:
    function_signature = signature(
        embedding_function.__class__.__call__
    ).parameters.keys()
    protocol_signature = signature(EmbeddingFunction.__call__).parameters.keys()

    if not function_signature == protocol_signature:
        raise ValueError(
            f"Expected EmbeddingFunction.__call__ to have the following signature: {protocol_signature}, got {function_signature}\n"
            "Please see https://docs.trychroma.com/embeddings for details of the EmbeddingFunction interface.\n"
            "Please note the recent change to the EmbeddingFunction interface: https://docs.trychroma.com/migration#migration-to-0416---november-7-2023 \n"
        )

I would suggest allowing additional keyword arguments to be pasted to EmbeddingFunction. The signature would be

class EmbeddingFunction(Protocol[D]):
    def __call__(self, input: D, **kwargs) -> Embeddings:

Versions

Version: 0.4.17

Relevant log output

No response

atroyn commented 11 months ago

Hi @LumenYoung, thanks for pointing this out. Our initial implementation of multi-modal embeddings does indeed assume a single datatype at a time.

While the proposed change might work for the embedding function itself, it's also important to note thatCollection.add, Collection.query and other methods calling the embedding function also expect only one of documents or images, not both at the same time.

There is some complexity around what we do when both kinds of data come in, and what a DataLoader should do in this case. We would like to support this type of joint embedding - we would welcome your contribution.

LumenYoung commented 11 months ago

Hi @LumenYoung, thanks for pointing this out. Our initial implementation of multi-modal embeddings does indeed assume a single datatype at a time.

While the proposed change might work for the embedding function itself, it's also important to note thatCollection.add, Collection.query and other methods calling the embedding function also expect only one of documents or images, not both at the same time.

There is some complexity around what we do when both kinds of data come in, and what a DataLoader should do in this case. We would like to support this type of joint embedding - we would welcome your contribution.

Thanks for the reply. Yes, it would be a large modification if several signatures need to be modified. Another way to workaround this is to loosen the scope for D, allowing user to pass Dictionary into it. This is also a possible workaround. The query doesn't need to be changed in this case I think since only tricky part of multimodal embedding is only creating that single embedding, this is the only gateway should be loosen since the creation process might involve different modalities, the query could be solve with metadata attached, which is what I'm intending to do.

I wonder if this way is more suitable modification to propose?

atroyn commented 11 months ago

I think we would want to carefully consider any approach here since it would touch a lot of the API surface.

Could you expand on your proposal with some example code?

LumenYoung commented 11 months ago

I think we would want to carefully consider any approach here since it would touch a lot of the API surface.

Could you expand on your proposal with some example code?

Sure. I'll add my example later today when I got time. I'm also playing with Chroma to have better understanding on the implication of such interface change. I would surely like to contribute to the chroma project.

tazarov commented 11 months ago

@LumenYoung, what do you think about supplying text and images in a tuple? Then your D can be a tuple, and you work with a list of tuples. There are a couple of advantages of doing this:

Your text and image are consistently linked together via the tuple
We don't have to modify the signatures at all

@atroyn, what about the idea of adding embedding function modality support validators? This will be similar to how MIME types work in browsers, where each EF wrapper will provide a list of modalities it can support, e.g. text, text-image, text-audio, etc. When users try to embed using the model, we will check that provided inputs match at least one, if not all (depending on the model), of the supported modalities.

chroma-core / chroma