Document Reader for different Domain.

samdash commented 6 years ago

Hi Adam,

I was able to successfully do the POC on Document Retriever integrated with ElasticSearch, now moving on to Document Reader and need Help. I am planning to do POC on Document Reader for different Domain and not use the SQuAD-v1.1-dev/train. What are the steps, do i have to create a JSON file something similar to SQuAD-v1.1-dev.json and SQuAD-v1.1-train.json ? as per readME there are 2 formats which one should i have to develop Format A or Format B . https://github.com/facebookresearch/DrQA/blob/master/scripts/reader/README.md

If you can quickly list down the steps that would be great.. i know there is readMe but does not mention steps for different domain.

ajfisch commented 6 years ago

It doesn't matter what the domain is as long as the file is in the right format.

For completeness, here are some steps you can follow:

Obtain a list of question/context/answer pairs and format it like SQuAD (format B)
Get the preprocessed train-ready file by running scripts/reader/preprocess.py
You might want to augment the training data with SQuAD data -- just concatenate with a preprocessed SQuAD dataset. You can also do fine-tuning by starting with a model pre-trained on SQuAD.
Run training with the appropriate train/dev file paths set

You might only have question/answer pairs with no context. In that case you can follow the distant supervision readme.

samdash commented 6 years ago

Thanks Adam.

a) the question/context/answer pairs that SQuAD ( format B) is having something like "answer_start": 177 what does this represent , do i need to have this?

{
  "data": [
    {
      "title": "Super_Bowl_50",
      "paragraphs": [
        {
          "context": "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50.",
          "qas": [
            {
              "answers": [
                {
                  "answer_start": 177,
                  "text": "Denver Broncos"
                },
                {
                  "answer_start": 177,
                  "text": "Denver Broncos"
                },
                {
                  "answer_start": 177,
                  "text": "Denver Broncos"
                }
              ],
              "question": "Which NFL team represented the AFC at Super Bowl 50?",
              "id": "56be4db0acb8001400a502ec"
            }
            ]
        }
        ]
}
]
}

What does Run training with appropriate train/dev file path set mean? where do i get these from ? if all i got is question/context/answers pairs.

And you also mentioned if i had only question/answer pairs with no context, i need to follow distant supervision readMe, so for webquestion-test.txt which are question/answer pairs we use distant supervision?

ajfisch commented 6 years ago

answer_start is the starting character offset of the answer string in the context.

For supplying new train/dev files, look at the reader README, specifically the arguments for "Filesystem".

You need to make your new train and dev files. Then you need to train your model using them.

samdash commented 6 years ago

I tried with just question/answers pairs that you have for web questions-test.txt then i ran python scripts/distant/generate.py data/datasets/ WebQuestions-test.txt /path/to/output/dir it generated 2 files WebQuestions-test.dstrain and WebQuestions-test.dsdev, however WebQuestions-test.dsdev size is zero bytes.

how can i make my new train and dev files ? assuming my question/answer pair is similar to webquestions-test.txt

Sam

ajfisch commented 6 years ago

Please see the distant supervision README.

samdash commented 6 years ago

Here is what i understood correct me if i am missing anything after going through all the ReadMe's

1) Obtain a list of question/answer pairs and format it like ( Format A) , if having a context format it like SQuAD( Format B ) 2) if Format B : Get the preprocessed train-ready file by running scripts/reader/preprocess.py Question: To run above script you need to have SQuAD train and dev file , how can i generate them for my new Domain and not use existing ones?

3) if Format A : you can generate dstrain and dsdev by executing below command. python scripts/distant/generate.py --dev-split 0.4 data/datasets/ WebQuestions-test.txt /Users/xxxx/Downloads and to see results execute below command python scripts/distant/check_data.py /Users/xxx/Downloads/WebQuestions-test.dstrain

4) once above step 3 is completed , ReadMe say you can concatenate dstrain with SQuAD preprocessed train file , once done you can generate the model.

Sorry to ask too many questions.. trying to make it work .

ajfisch commented 6 years ago

There are options to that script to use different input files from the SQuAD datasets.
Yes, but you should be using your own input file, not necessarily WebQuestions.

samdash commented 6 years ago

something like python scripts/reader/preprocess.py data/datasets data/datasets --split custom-SQuAD-File
Understood , thanks

vaibhavgeek commented 6 years ago

Following format B,

I do not have question answer pairs for the contexts given. Can I still train the data to run the reader successfully?

samdash commented 6 years ago

if u have format B , then yes.

goldenbili commented 6 years ago

I want to konw id in format b, what does that mean?

facebookresearch / DrQA

Document Reader for different Domain. #76