Tan-JT / QuestionAnsweringAIP

0 stars 0 forks source link

Closed-domain Question Answering on AIP dataset

Introduction

The goal of this project is to help Air Traffic Controllers retrieve information swiftly from a corpus consisting of ATC manuals.

Documentation

Code Organization

The code is set up into several main directories:

Data Processing

I converted the AIP manual into a .txt file by parsing it with tika, while cleaning the data by removing symbols such as (~,^,*) List items were concatenated onto the same line, with whitespaces removed. I then extracted 40 questions from this passage using cdQA-annotator in SQuAD format as follows:

{"version":"v2.0"
  "data":[{"title":"AIP",
            "paragraphs":[{"qas":[{"question":"When is the deadline for agents of non-scheduled flights to submit their slot requests?",
                                   "id":"questionID1",
                                   "answers":[{"answer_start:180,
                                               "text":"no later than 24 hours prior to the operation of the flight"}]},
                                  {"question":"Who should operators of commercial flights submit their slot requests to?",
                                   "id":"questionID2",
                                   "answers":[{"answer_start":116,
                                               "text":"Changi Slot Coordinator"}]}],
                                   "context":"Operators or agents of non-scheduled, commercial and non-commercial flights shall submit their slot requests to the 
                                              Changi Slot Coordinator no earlier than 7 calendar days and but no later than 24 hours prior to the operation of the
                                              flight, for which the slot will be utilized"}]}]}

Evaluation

Each model is evaluated by their top three predictions for each question, based on F1 score. Here is a quick summary of the formula:

Results

After evaluating each of the models, I then constructed ensembles to see if accuracy can be improved. It is clear that each model excels at different tasks, so we want to choose pairs that best compensate for each other's weaknesses.

F1 Score ALBERT ROBERTA ELECTRA BERT Ensemble (ALBERT + ELECTRA) Ensemble (ALBERT + ROBERTA)
Overall 0.5745 0.5781 0.5471 0.3472 0.7166 0.6890
Single Supporting Fact 0.5952 0.7437 0.6708 0.4400 0.7379 0.7640
Yes/No Questions 0.2899 0.2222 0.6667 0.3939 0.6667 0.2962
Lists/Sets 0.5931 0.4477 0.1333 0.1200 0.5931 0.6594
Simple Negation 0.8488 0.3140 0.2465 0.0612 0.8488 0.8721