Closed hist0613 closed 4 years ago
Good catch! I calculated QTR on each query-document pair on MS MARCO, rather than on multiple queries. But I do not think it will have a huge effect, because the majority of passages only have one relevant query in the MS MARCO passage ranking dataset.
Stopwords are removed from QTR. They are also stemmed. As can be seen in the readme.md example:
{"query": "what kind of animals are in grasslands", "term_recall": {"grassland": 1, "animals": 1}, "doc": {"position": "1", "id": "4083643", "title": "Tropical grassland animals (which do not all occur in the same area) include giraffes, zebras, buffaloes, kangaroos, mice, moles, gophers, ground squirrels, snakes, worms, termites, beetles, lions, leopards, hyenas, and elephants."}}`
Term recall for "in" is 0 because it is a stopword. Term recall for "grassland" is 1 because its plural form "grasslands" appear in the query.
UPDATE: I have uploaded the QTR generation scripts: scripts/get_training_query_term_recall.py
Thanks to that, it becomes clear!
Following your paper, the example query in the README ("what kind of animals are in grasslands") seems to have the term_recall for "in" as 1, like QTR("in", d) = 1 or TR("in", q) = 1. Am I right? or Is there any stopword list applied?
In my understanding, there are no proper examples of having QTR(t, d) other than 1 in provided myalltrain.relevant.docterm_recall, even though there are many examples of one-to-many matchings in MSMARCO dataset. If I'm correct, is the pre-computed QTR not provided in this repository ?