feat: Finetune QA Model

Enhancing QA Models

Objective

The primary goal is to explore the effectiveness of different QA models in generating accurate answers to questions. This exploration includes testing models like roberta-base-squad2 and others. Given that fine-tuning language models did not yield the desired outcomes, the focus will shift towards QA models. The responses from these models will be augmented using a MASK model to produce refined outputs.

Methodology

Model Identification: Identify suitable QA models for evaluation, including but not limited to roberta-base-squad2.
Fine-Tuning Capability Analysis: Investigate whether the identified models support fine-tuning with custom context values gathered from various sources. If fine-tuning is feasible, proceed with it; otherwise, use the models with direct context input.
Output Observation: Monitor the outputs generated by the models and discuss the accuracy of the responses to the input questions.

As a result of the research, it has been determined that fine tuning can be done on QA models with Haystack, and both the accuracy value can increase and the response times can decrease.

Fine Tuning "deepset/roberta-base-squad2" Code

from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
reader.train(data_dir=data_dir, train_filename="../data/squad_formatted_friendsqa_data.json", use_gpu=True, n_epochs=3, save_dir="my_model_3epoch")

It was concluded that data in SQuAD format is needed for the fine tuning process of QA models.

SQuAD Data Format

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. [1]

Example format

{
"data": [
    {
        "paragraphs": [
            {
                "context": "The quick brown fox jumps over the lazy dog.",
                "qas": [
                    {
                        "question": "What does the fox jump over?",
                        "id": "q1",
                        "answers": [
                            {
                                "text": "the lazy dog",
                                "answer_start": 32
                            }
                        ]
                    }
                ]
            }
        ],
        "title": "Example"
    }
],
"version": "2.0"
}

The friendsQA dataset obtained from project emorynlp/FriendsQA was converted to SQuAD format and fine tune experiments were performed on the models.

During this converting sapha, Chinmay Bhalerao's Medium Article and Haystack[2] were used.

Below you can see the parameters and output times of some of the tried models.

The Models Used and Parameters-Response Times

deepset/roberta-base-squad2

retriever = 3, reader = 5, response: 7sec
fine_tune: deepset/roberta-base-squad2

EPOCH: 3, retriever = 3, reader = 5, response: 5.11sec
fine_tune: deepset/roberta-base-squad2

EPOCH: 10, retriever = 3, reader = 5, response: 7.24sec
bert-large-uncased-whole-word-masking-finetuned-squad

retriever = 3, reader = 5, response: 14sec
bert-large-uncased-whole-word-masking-finetuned-squad

retriever = 1, reader = 1, response: 8sec
google/electra-large-generator

all scores less than 0.1 with retriever = 5, reader = 5

Below you can find the analysis graphs of the outputs produced by the models for a friends quiz containing 142 questions using difflib.SequenceMatcher, and QA model's own scores.

As a result of the examinations, it was observed that the fine tuning process reduced the response time and had a positive effect on the accuracy of the answers produced. Among the tried models, the best one is considered to be the deepset/roberta-base-squad2 model, which was trained for 3 epochs and produced output with 3 retrievers and 5 readers.

Dijital-Twin / model