Closed hustnn closed 2 months ago
Hey, could you give a bit more information? What document did you index (the Vidore test set or the ColPali paper), with what model (v1.2 or the original checkpoint)?
hello @bclavie,
I am indexing the ColPali paper can be found via https://arxiv.org/pdf/2407.01449.
This is the code for indexing and search, is it original checkpoint? You can try it, I guess page 18 has some similar text similar to the question but the images in page 15 and page 1 should contain the answer of the question.
from byaldi import RAGMultiModalModel
# Optionally, you can specify an `index_root`, which is where it'll save the index. It defaults to ".byaldi/".
RAG = RAGMultiModalModel.from_pretrained("vidore/colpali")
RAG.index(
input_path="/content/ColPali.pdf",
index_name="image_index", # index will be saved at index_root/index_name/
store_collection_with_index=False,
overwrite=True
)
text_query = "Which hour of the day had the highest overall eletricity generation in 2019?"
results = RAG.search(text_query, k=5)
Oh I see, thank you for clarifying!
The example in page 1 is very noisy (since it's overlayed with the score map) so I'm not too surprised it only scores 3rd, given page 15 has a much document answering the same question. Page 18 coming up also kind of makes sense, since the phrasing of the questions in the examples there is very similar, so you'll have a high score just based on the textual overlap. The score for those 3 pages is also quite a bit higher than for other results, so it kind of makes sense.
To produce Figure 1, they performed the search on the actual full document from ViDoRe
, which doesn't have the added overlay, so it'd quite naturally score higher.
I think you could try the ColPali 1.2 checkpoint, which is more robust, but the behaviour seems quite okay to me. (In a production pipeline, you'd want to pipe more than the just the 1st result unless you're working in an extremely narrow area, so the LLM would still get relevant context here)
Hi @bclavie ,
Yes, I like ColPali and the overall results are quite good. Providing more pages to the VL should be helpful to find the correct answer.
I am curious how to improve it for the case in page 18. The textual overlap should be the query shown below. The year in my question (Which hour of the day had the highest overall eletricity generation in 2019?
) is 2019 instead of the 2030 in page 18.
Sadly there isn't too much to be said about this -- there's enough similar/overlap that it ends up being enough for the model to score page 18 quite highly, especially as neural models tend to be pretty poor at handling digits: the embedding for 2030
and 2019
is likely to be very similar, so it won't be a huge discriminator.
Models are rapidly improving though, so this kind of confusion might well go away after a few more iterations! After all, ColPali is "just" the very first checkpoint of this new generation of multi-modal late-interaction retrievers.
@bclavie Thanks for your reply. I am doing some research on it and hope I can make some contributions during the iterations!
I directly use the question in Figure 1 of ColPali paper.
The highest score in the answer is page 18, any guess on the reason on it?