AnswerDotAI / byaldi

Use late-interaction multi-modal models such as ColPali in just a few lines of code.
Apache License 2.0
328 stars 33 forks source link

Document classification #13

Closed sandorkonya closed 2 days ago

sandorkonya commented 1 week ago

Thank you for this repo!

I saw the text based RAG example where a text query returns the most likely page with the possible answer on it.

Is it possible to use the system for document classification task where the query is not a text but a (page of an) another document? For example with a simple vector similarity of their embeddings? How can we access the embeddings of the pages themselves for this?

Regards

sandorkonya commented 1 week ago

The solution, if someone finds this:

from pdf2image import convert_from_path
from byaldi.colpali import ColPaliModel

model = ColPaliModel.from_pretrained("vidore/colpali-v1.2", device="cuda", verbose=1)

pdf_pages = convert_from_path("./docs/attention.pdf", dpi=300 )

encoded_result = ColPaliModel.encode_image(model, input_data=pdf_pages[0])

encoded_result.shape

--> torch.Size([1, 1030, 128])

LarsAC commented 1 week ago

So is the 1030x128 an embedding then ? Have you successfully clustered / classified documents via similarity on those ?

bclavie commented 2 days ago

Hey,

I'm not sure if late interaction is very strong as a classifier, or at least, not sure if it's any stronger than just using the base VLM (in this case, PaliGemma) to perform classification. But yes, in terms of using Byaldi, there are (currently undocumented as they're decently un-tested) helper functions encode_image and encode_query to get just raw embeddings out of the models.