AnswerDotAI / byaldi

Use late-interaction multi-modal models such as ColPali in just a few lines of code.
Apache License 2.0
350 stars 36 forks source link

Ordering of input files nondeterministic, which can assign incorrect doc id, metadata #3

Open dimroc opened 2 weeks ago

dimroc commented 2 weeks ago

for i, item in enumerate(list(input_path.iterdir())): can return the files in all sorts of ordering styles. I'm not even sure what it is on my local, but it isn't lexical. What ordering are you expecting?

As a result, we don't know what order to pass in the doc_ids and metadata. Let's either come up with predictable ordering or map filenames to ids/metadata so it isn't left to mismapping 0...N indices.

I have some bandwidth, so once we have a direction, I can implement it if you'd like.

Snippet:

https://github.com/AnswerDotAI/byaldi/blob/427a859cfac3010c3e51451458506a19e203a8b7/byaldi/colpali.py#L278-L289

bclavie commented 2 weeks ago

Hey, thank you for raising, well spotted!

This has actually been on my to-do in some form, and the way I've been thinking about handling this is:

I'm not sure I'll have the bandwidth to do it today (I'd love to though), so if you're interested in taking this on and think you can do it quickly, you're more than welcome!

bclavie commented 2 weeks ago

In the meantime, I'm releasing 0.0.2, which among other minor updates, makes index(), add_to_index() and the newly added get_doc_ids_to_file_names() return a dict mapping {doc_id: filename}, so it's easy to pass the right context to your LLM, though it doesn't fix the metadata issue.