AdeDZY / DeepCT

DeepCT and HDCT uses BERT to generate novel, context-aware bag-of-words term weights for documents and queries.
BSD 3-Clause "New" or "Revised" License
312 stars 46 forks source link

Does myalltrain.relevant.docterm_recall provide QTR(t,d) ? #6

Closed hist0613 closed 4 years ago

hist0613 commented 4 years ago

Following your paper, the example query in the README ("what kind of animals are in grasslands") seems to have the term_recall for "in" as 1, like QTR("in", d) = 1 or TR("in", q) = 1. Am I right? or Is there any stopword list applied?

In my understanding, there are no proper examples of having QTR(t, d) other than 1 in provided myalltrain.relevant.docterm_recall, even though there are many examples of one-to-many matchings in MSMARCO dataset. If I'm correct, is the pre-computed QTR not provided in this repository ?

AdeDZY commented 4 years ago

Good catch! I calculated QTR on each query-document pair on MS MARCO, rather than on multiple queries. But I do not think it will have a huge effect, because the majority of passages only have one relevant query in the MS MARCO passage ranking dataset.

Stopwords are removed from QTR. They are also stemmed. As can be seen in the readme.md example:

{"query": "what kind of animals are in grasslands", "term_recall": {"grassland": 1, "animals": 1}, "doc": {"position": "1", "id": "4083643", "title": "Tropical grassland animals (which do not all occur in the same area) include giraffes, zebras, buffaloes, kangaroos, mice, moles, gophers, ground squirrels, snakes, worms, termites, beetles, lions, leopards, hyenas, and elephants."}}`

Term recall for "in" is 0 because it is a stopword. Term recall for "grassland" is 1 because its plural form "grasslands" appear in the query.

UPDATE: I have uploaded the QTR generation scripts: scripts/get_training_query_term_recall.py

hist0613 commented 4 years ago

Thanks to that, it becomes clear!