AnswerDotAI / byaldi

Use late-interaction multi-modal models such as ColPali in just a few lines of code.
Apache License 2.0
634 stars 62 forks source link

One example of Figure 1 in ColPali paper #18

Closed hustnn closed 2 months ago

hustnn commented 2 months ago

I directly use the question in Figure 1 of ColPali paper.

image

The highest score in the answer is page 18, any guess on the reason on it?

image
bclavie commented 2 months ago

Hey, could you give a bit more information? What document did you index (the Vidore test set or the ColPali paper), with what model (v1.2 or the original checkpoint)?

hustnn commented 2 months ago

hello @bclavie,

I am indexing the ColPali paper can be found via https://arxiv.org/pdf/2407.01449.

This is the code for indexing and search, is it original checkpoint? You can try it, I guess page 18 has some similar text similar to the question but the images in page 15 and page 1 should contain the answer of the question.


from byaldi import RAGMultiModalModel
# Optionally, you can specify an `index_root`, which is where it'll save the index. It defaults to ".byaldi/".
RAG = RAGMultiModalModel.from_pretrained("vidore/colpali")

RAG.index(
    input_path="/content/ColPali.pdf",
    index_name="image_index", # index will be saved at index_root/index_name/
    store_collection_with_index=False,
    overwrite=True
)

text_query = "Which hour of the day had the highest overall eletricity generation in 2019?"
results = RAG.search(text_query, k=5)
bclavie commented 2 months ago

Oh I see, thank you for clarifying!

The example in page 1 is very noisy (since it's overlayed with the score map) so I'm not too surprised it only scores 3rd, given page 15 has a much document answering the same question. Page 18 coming up also kind of makes sense, since the phrasing of the questions in the examples there is very similar, so you'll have a high score just based on the textual overlap. The score for those 3 pages is also quite a bit higher than for other results, so it kind of makes sense.

To produce Figure 1, they performed the search on the actual full document from ViDoRe, which doesn't have the added overlay, so it'd quite naturally score higher.

I think you could try the ColPali 1.2 checkpoint, which is more robust, but the behaviour seems quite okay to me. (In a production pipeline, you'd want to pipe more than the just the 1st result unless you're working in an extremely narrow area, so the LLM would still get relevant context here)

hustnn commented 2 months ago

Hi @bclavie , Yes, I like ColPali and the overall results are quite good. Providing more pages to the VL should be helpful to find the correct answer. I am curious how to improve it for the case in page 18. The textual overlap should be the query shown below. The year in my question (Which hour of the day had the highest overall eletricity generation in 2019?) is 2019 instead of the 2030 in page 18.

image
bclavie commented 2 months ago

Sadly there isn't too much to be said about this -- there's enough similar/overlap that it ends up being enough for the model to score page 18 quite highly, especially as neural models tend to be pretty poor at handling digits: the embedding for 2030 and 2019 is likely to be very similar, so it won't be a huge discriminator.

Models are rapidly improving though, so this kind of confusion might well go away after a few more iterations! After all, ColPali is "just" the very first checkpoint of this new generation of multi-modal late-interaction retrievers.

hustnn commented 2 months ago

@bclavie Thanks for your reply. I am doing some research on it and hope I can make some contributions during the iterations!

  1. one idea is whether the existing mode fine-tuning approach can be applied to the new generation retrievers.
  2. Another idea is from system view, if we consider distributed setting, can existing approach be scaled?