facebookresearch / DrQA

Reading Wikipedia to Answer Open-Domain Questions
Other
4.48k stars 896 forks source link

finding difficulty in executing distant supervision-Generate.py #134

Closed rajashekar2012 closed 5 years ago

rajashekar2012 commented 6 years ago

Hi,

I have prepared 100 question and answer as per format A, given input to distant supervision which is outputting empty .dstrain file.

I got stuck here, I would like to train with these question and answers. Can I get help to resolve the issue?

ajfisch commented 6 years ago

Can you give some examples?

rajashekar2012 commented 6 years ago

Hi, In the mean time I have got a chance to go through the code and started debugging. When I asked a question "Explain melting point", we observed that, even though there is an exact match of the answer in the extracted passages is not coming in the .dstrain file. Also observed that the token Melting point did not identified under NER, do we have any provision to make melting point to be identified under NER.

Q2). Regarding training using generated .dstrain file, we observed both train file and Dev file need to be provided as input to train.py, what is the necessity of Dev file and what should be provided under dev file.

ajfisch commented 6 years ago

1) You can modify the distant supervision heuristics to filter out fewer documents.

2) We found that using the standard processed Squad dev file for training multi-task works well. If you are fine tuning, you will want to use the dsdev file instead.