questions on corpus encoding

facebookresearch / multihop_dense_retrieval

Multi-hop dense retrieval for question answering

Other

214 stars 22 forks source link

questions on corpus encoding #6

Closed jzhoubu closed 3 years ago

jzhoubu commented 3 years ago

Hi, thanks for sharing the work. I didn't find the scripts/get_embed.py file as you mentioned in the readme file. I want to confirm that: 1) Is scripts/encode_corpus.py equivalent to scripts/get_embed.py? 2) Is $MDR_HOME/data/hotpot_index/wiki_id2doc.json used for CORPUS_PATH?

xwhan commented 3 years ago

Thank you for pointing out! Yes, the get_embed.py is renamed as encode_corpus.py. I have updated the readme. For CORPUS_PATH, you could easily build the input corpus file yourself, the wiki_id2doc.json should be the output of this step, instead of the input. I have also included a pointer to the HotpotQA corpus processing guide in the README to easily build the input corpus file.

jzhoubu commented 3 years ago

Thanks for the reply @xwhan .

I am reproducing the raw corpus from Wikipedia dump according to encode_datasets.py functions, and so far had met several errors during encoding. For example, I am not sure how you handle the empty text in here. It brings me an error if I preserve the empty text as {"title": "whatever", "text": " "}. Did you remove all the object which contains empty string?

I think it would be nice and more convenient if you could provide the processed input corpus. Would it be possible to share that?

xwhan commented 3 years ago

The processed corpus does have some empty passages. Either removing it or filling the "text" field with the document title should be fine. In encode_datasets.py, the RoBERTa encoder does not accept empty string for the 'text_pair' argument.