Closed kiucho closed 1 year ago
Hi thanks for the question.
For the dev, we do the retrieval prediction from 'question' -> 'docids', thus we don't want to add passages to the dev dataset. The reason we add both passage and question to the train set is that we are doing indexing and retrieval multitask training that is described in section 3.3 in the original paper and also Figure 4. if the data example is 'passage' -> 'docid' then it is an indexing example. if the data example is 'question' -> 'docid' then it is a retrieval example. Here we simply set multitask ratio to 1, i.e., 1 indexing example has 1 retrieval example.
Thank you for your quick reply!
Now, I fully understand and thank you very much for the detailed explanation.
you are welcome, I'm glad to help :)
Hello, I am a student reproducting "DSI".
Now, I am creating MS_MARCO Train and Dev data with "get_data.sh" and I got a question.
In process_marco.py line 73~74
We append both Question and positive Passage to DSI_train_data
but line 92~95
We append positive Passage to DSI_train_data and Question to DSI_dev_data
Maybe it's because I didn't understand original paper properly, but I don't know why we don't add both passage and question to dev data in this part(line 92~95).
I'd appreciate it if you left a reply. Thank you.