bclavie / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Apache License 2.0
2.45k stars 173 forks source link

Queries with multiple relevant documents #187

Open salbatarni opened 2 months ago

salbatarni commented 2 months ago

Hey, I see ragatouille handels many different forms of pairs. But I do not see an example for a query with multiple positive documents. Is it like: (query, [list of relevant documents]) ?

bclavie commented 2 months ago

Hey! Do you mean during training (e.g. you have multiple positive per query and want to use all of them for training) ?

If so, RAGatouille goes with the most common IR pattern which is that each (query, relevant) pair should be independent, so if you have, say, 5 documents as positives for the same query, you'd create 5 pairs (or triplets), each of them containing just one of the positives and the query.

salbatarni commented 2 months ago

Okay great! I was wondering how this is being handeled in prepare_training_data? So far I am passing the pairs like in the second tutorial. In the tutorial, the pairs does not contain the query ids, so how its handeled? I am worried that when sampling negative documentes, positive documents will be sampled. Is there anything I am missing?

salbatarni commented 2 months ago

@bclavie 👀