Closed sandorkonya closed 2 days ago
The solution, if someone finds this:
from pdf2image import convert_from_path
from byaldi.colpali import ColPaliModel
model = ColPaliModel.from_pretrained("vidore/colpali-v1.2", device="cuda", verbose=1)
pdf_pages = convert_from_path("./docs/attention.pdf", dpi=300 )
encoded_result = ColPaliModel.encode_image(model, input_data=pdf_pages[0])
encoded_result.shape
--> torch.Size([1, 1030, 128])
So is the 1030x128 an embedding then ? Have you successfully clustered / classified documents via similarity on those ?
Hey,
I'm not sure if late interaction is very strong as a classifier, or at least, not sure if it's any stronger than just using the base VLM (in this case, PaliGemma) to perform classification. But yes, in terms of using Byaldi, there are (currently undocumented as they're decently un-tested) helper functions encode_image
and encode_query
to get just raw embeddings out of the models.
Thank you for this repo!
I saw the text based RAG example where a text query returns the most likely page with the possible answer on it.
Is it possible to use the system for document classification task where the query is not a text but a (page of an) another document? For example with a simple vector similarity of their embeddings? How can we access the embeddings of the pages themselves for this?
Regards