Open LumenYoung opened 11 months ago
Hi @LumenYoung, thanks for pointing this out. Our initial implementation of multi-modal embeddings does indeed assume a single datatype at a time.
While the proposed change might work for the embedding function itself, it's also important to note thatCollection.add
, Collection.query
and other methods calling the embedding function also expect only one of documents or images, not both at the same time.
There is some complexity around what we do when both kinds of data come in, and what a DataLoader
should do in this case. We would like to support this type of joint embedding - we would welcome your contribution.
Hi @LumenYoung, thanks for pointing this out. Our initial implementation of multi-modal embeddings does indeed assume a single datatype at a time.
While the proposed change might work for the embedding function itself, it's also important to note that
Collection.add
,Collection.query
and other methods calling the embedding function also expect only one of documents or images, not both at the same time.There is some complexity around what we do when both kinds of data come in, and what a
DataLoader
should do in this case. We would like to support this type of joint embedding - we would welcome your contribution.
Thanks for the reply. Yes, it would be a large modification if several signatures need to be modified. Another way to workaround this is to loosen the scope for D, allowing user to pass Dictionary into it. This is also a possible workaround. The query doesn't need to be changed in this case I think since only tricky part of multimodal embedding is only creating that single embedding, this is the only gateway should be loosen since the creation process might involve different modalities, the query could be solve with metadata attached, which is what I'm intending to do.
I wonder if this way is more suitable modification to propose?
I think we would want to carefully consider any approach here since it would touch a lot of the API surface.
Could you expand on your proposal with some example code?
I think we would want to carefully consider any approach here since it would touch a lot of the API surface.
Could you expand on your proposal with some example code?
Sure. I'll add my example later today when I got time. I'm also playing with Chroma to have better understanding on the implication of such interface change. I would surely like to contribute to the chroma project.
@LumenYoung, what do you think about supplying text and images in a tuple? Then your D can be a tuple, and you work with a list of tuples. There are a couple of advantages of doing this:
@atroyn, what about the idea of adding embedding function modality support validators? This will be similar to how MIME types work in browsers, where each EF wrapper will provide a list of modalities it can support, e.g. text, text-image, text-audio, etc. When users try to embed using the model, we will check that provided inputs match at least one, if not all (depending on the model), of the supported modalities.
What happened?
Dear team on Chroma,
Strictly speaking, this is not a bug rather the constraint chroma imposed. However, I find that chroma has a too strict EmbeddingFunction typing which is not suitable for my usecase and it prevents me from doing the proper thing.
In short, chroma only allow a single D type variable (either image or Docuement) to be pasted to the EmbeddingFunction. However, I'm using both image and text to create a joint embedding with llava model. So this restriction makes it very hard to use the chroma vector store.
The following is my definition for the embedding function, which makes total sense if you were trying to generate an embedding from multiple modalities.
But the typing is too strict for this kind of proper usage:
I would suggest allowing additional keyword arguments to be pasted to EmbeddingFunction. The signature would be
Versions
Version: 0.4.17
Relevant log output
No response