Closed jzhoubu closed 3 years ago
Thank you for pointing out! Yes, the get_embed.py
is renamed as encode_corpus.py
. I have updated the readme. For CORPUS_PATH
, you could easily build the input corpus file yourself, the wiki_id2doc.json
should be the output of this step, instead of the input. I have also included a pointer to the HotpotQA corpus processing guide in the README to easily build the input corpus file.
Thanks for the reply @xwhan .
I am reproducing the raw corpus from Wikipedia dump according to encode_datasets.py
functions, and so far had met several errors during encoding. For example, I am not sure how you handle the empty text
in here. It brings me an error if I preserve the empty text as {"title": "whatever", "text": " "}
. Did you remove all the object which contains empty string?
I think it would be nice and more convenient if you could provide the processed input corpus. Would it be possible to share that?
The processed corpus does have some empty passages. Either removing it or filling the "text" field with the document title should be fine. In encode_datasets.py
, the RoBERTa encoder does not accept empty string for the 'text_pair' argument.
Hi, thanks for sharing the work. I didn't find the
scripts/get_embed.py
file as you mentioned in the readme file. I want to confirm that: 1) Isscripts/encode_corpus.py
equivalent toscripts/get_embed.py
? 2) Is$MDR_HOME/data/hotpot_index/wiki_id2doc.json
used forCORPUS_PATH
?