butsugiri / chainer-abcnn

Re-implementation of Attention-Based Convolutional Neural Network (ABCNN) by Chainer
7 stars 2 forks source link

Clarification in the jsonify.py code #4

Open kurtespinosa opened 7 years ago

kurtespinosa commented 7 years ago

Dear Butsugiri,

Thank you for sharing your code. I have a question about the input dataset which I would need to jsonify. I download the dataset and used the respective data partitions, for example, WikiQA-test.tsv for test set which has a sample file entry below.

QuestionID Question DocumentID DocumentTitle SentenceID Sentence Label Q0 HOW AFRICAN AMERICANS WERE IMMIGRATED TO THE US D0 African immigration to the United States D0-0 African immigration to the United States refers to immigrants to the United States who are or were nationals of Africa . 0

Now, I'm confused because in the jsonify code, the question would point to D0-0 which is the sentenceID. It seems that the question_id and the question were interchanged, am I right or did I miss out anything?

question_id = data[1]
....
question = data[-3]
answer = data[-2]
....
....
'question': question.lower().split(" "),
'answer': answer.lower().split(" "),

should have been the following?

question = data[1]
.....
question_id = data[-3]
answer = data[-2]
....
....
'question': question.lower().split(" "),
'answer': answer.lower().split(" "),

Cheers, Kurt

butsugiri commented 7 years ago

Hi,

The indexing for some variables like question and queston_id seem interchanged, because jsonify.py requires some extra preprocessing beforehand (and I am sorry that it is not provided on this repo). It is basically for removing the questions that do not contain correct answer in it, as described on the original paper. So please fix the code if you think it is necessary.

After preprocessing, the file should look like:

{"label": "0", "sentence_id": "D11-0", "question": ["how", "big", "is", "bmc", "software", "in", "houston", ",", "tx"], "title": "BMC Software", "answer": ["bmc", "software", ",", "inc.", "is", "an", "american", "company", "specializing", "in", "business", "service", "management", "(", "bsm", ")", "software", "."], "document_id": "D11", "question_id": "Q11"}
{"label": "0", "sentence_id": "D11-1", "question": ["how", "big", "is", "bmc", "software", "in", "houston", ",", "tx"], "title": "BMC Software", "answer": ["headquartered", "in", "houston", ",", "texas", ",", "bmc", "develops", ",", "markets", "and", "sells", "software", "used", "for", "multiple", "functions", ",", "including", "it", "service", "management", ",", "data", "center", "automation", ",", "performance", "management", ",", "virtualization", "lifecycle", "management", "and", "cloud", "computing", "management", "."], "document_id": "D11", "question_id": "Q11"}

Each line contains one QA pair in json format.

kurtespinosa commented 7 years ago

Thank you for taking time to answer my question. This clarifies it.