Train and Dev data distribution in process_marco.py

kiucho commented 1 year ago

Hello, I am a student reproducting "DSI".

Now, I am creating MS_MARCO Train and Dev data with "get_data.sh" and I got a question.

In process_marco.py line 73~74

DSI_train_data.append({'text_id': rand_ids[current_ind], 'text': 'Passage: ' + passage})
DSI_train_data.append({'text_id': rand_ids[current_ind], 'text': 'Question: ' + question})

We append both Question and positive Passage to DSI_train_data

but line 92~95

DSI_train_data.append({'text_id': rand_ids[current_ind],
                                      "text": f"Passage: {passage}"})
DSI_dev_data.append({'text_id': rand_ids[current_ind],
                                    "text": f"Question: {question}"})

We append positive Passage to DSI_train_data and Question to DSI_dev_data

Maybe it's because I didn't understand original paper properly, but I don't know why we don't add both passage and question to dev data in this part(line 92~95).

I'd appreciate it if you left a reply. Thank you.

ArvinZhuang commented 1 year ago

Hi thanks for the question.

For the dev, we do the retrieval prediction from 'question' -> 'docids', thus we don't want to add passages to the dev dataset. The reason we add both passage and question to the train set is that we are doing indexing and retrieval multitask training that is described in section 3.3 in the original paper and also Figure 4. if the data example is 'passage' -> 'docid' then it is an indexing example. if the data example is 'question' -> 'docid' then it is a retrieval example. Here we simply set multitask ratio to 1, i.e., 1 indexing example has 1 retrieval example.

kiucho commented 1 year ago

Thank you for your quick reply!

Now, I fully understand and thank you very much for the detailed explanation.

ArvinZhuang commented 1 year ago

you are welcome, I'm glad to help :)

ArvinZhuang / DSI-QG

Train and Dev data distribution in process_marco.py #2