Closed samdash closed 6 years ago
It doesn't matter what the domain is as long as the file is in the right format.
For completeness, here are some steps you can follow:
scripts/reader/preprocess.py
You might only have question/answer pairs with no context. In that case you can follow the distant supervision readme.
Thanks Adam.
a) the question/context/answer pairs that SQuAD ( format B) is having something like "answer_start": 177 what does this represent , do i need to have this?
{
"data": [
{
"title": "Super_Bowl_50",
"paragraphs": [
{
"context": "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50.",
"qas": [
{
"answers": [
{
"answer_start": 177,
"text": "Denver Broncos"
},
{
"answer_start": 177,
"text": "Denver Broncos"
},
{
"answer_start": 177,
"text": "Denver Broncos"
}
],
"question": "Which NFL team represented the AFC at Super Bowl 50?",
"id": "56be4db0acb8001400a502ec"
}
]
}
]
}
]
}
What does Run training with appropriate train/dev file path set mean? where do i get these from ? if all i got is question/context/answers pairs.
And you also mentioned if i had only question/answer pairs with no context, i need to follow distant supervision readMe, so for webquestion-test.txt which are question/answer pairs we use distant supervision?
answer_start is the starting character offset of the answer string in the context.
For supplying new train/dev files, look at the reader README, specifically the arguments for "Filesystem".
You need to make your new train and dev files. Then you need to train your model using them.
I tried with just question/answers pairs that you have for web questions-test.txt
then i ran python scripts/distant/generate.py data/datasets/ WebQuestions-test.txt /path/to/output/dir
it generated 2 files WebQuestions-test.dstrain and WebQuestions-test.dsdev, however WebQuestions-test.dsdev size is zero bytes.
how can i make my new train and dev files ? assuming my question/answer pair is similar to webquestions-test.txt
Sam
Here is what i understood correct me if i am missing anything after going through all the ReadMe's
1) Obtain a list of question/answer pairs and format it like ( Format A) , if having a context format it like SQuAD( Format B ) 2) if Format B : Get the preprocessed train-ready file by running scripts/reader/preprocess.py Question: To run above script you need to have SQuAD train and dev file , how can i generate them for my new Domain and not use existing ones?
3) if Format A : you can generate dstrain and dsdev by executing below command.
python scripts/distant/generate.py --dev-split 0.4 data/datasets/ WebQuestions-test.txt /Users/xxxx/Downloads
and to see results execute below command
python scripts/distant/check_data.py /Users/xxx/Downloads/WebQuestions-test.dstrain
4) once above step 3 is completed , ReadMe say you can concatenate dstrain with SQuAD preprocessed train file , once done you can generate the model.
Sorry to ask too many questions.. trying to make it work .
python scripts/reader/preprocess.py data/datasets data/datasets --split custom-SQuAD-File
Following format B,
I do not have question answer pairs for the contexts given. Can I still train the data to run the reader successfully?
if u have format B , then yes.
I want to konw id in format b, what does that mean?
Hi Adam,
I was able to successfully do the POC on Document Retriever integrated with ElasticSearch, now moving on to Document Reader and need Help. I am planning to do POC on Document Reader for different Domain and not use the SQuAD-v1.1-dev/train. What are the steps, do i have to create a JSON file something similar to SQuAD-v1.1-dev.json and SQuAD-v1.1-train.json ? as per readME there are 2 formats which one should i have to develop Format A or Format B . https://github.com/facebookresearch/DrQA/blob/master/scripts/reader/README.md
If you can quickly list down the steps that would be great.. i know there is readMe but does not mention steps for different domain.