AMontgomerie / question_generator

An NLP system for generating reading comprehension questions
MIT License
281 stars 72 forks source link

Question on the QA Evaluator #21

Open eQX8HNMTU2 opened 1 year ago

eQX8HNMTU2 commented 1 year ago

Hello,

I'm looking at the source code to try to understand it somewhat bc im new to this field. but im stuck at the QA Evaluator.

I don't understand this method

def _get_ranked_qa_pairs(
        self, generated_questions: List[str], qg_answers: List[str], scores, num_questions: int = 10
    ) -> List[Mapping[str, str]]:
        """Ranks generated questions according to scores, and returns the top num_questions examples.
        """
        if num_questions > len(scores):
            num_questions = len(scores)
            print((
                f"\nWas only able to generate {num_questions} questions.",
                "For more questions, please input a longer text.")
            )

        qa_list = []

        for i in range(num_questions):
            index = scores[i]
            qa = {
                "question": generated_questions[index].split("?")[0] + "?",
                "answer": qg_answers[index]
            }
            qa_list.append(qa)

        return qa_list

it says it ranks the questions based on the score, but I don't see any sorting based on the score happening. I also tried debugging the code and it turns out the score doesn't take values between 0 and 1 as I initially anticipated, but rather some arbitrary number. The lowest i found was 2 and the highes was 120+. So I assumed the higher the score, the more the question and answer match together. However even for questions with score 120 it still sometimes doesnt feel like they match together waaay more than those with a score of ~30-40.

However I could not find any info about this on your readme nor on the huggingface page of the model.

I hope you can maybe give me some insight about the model and the inner workings. Thank you in advance. :)

AMontgomerie commented 1 year ago

Yeah you're right that there is no ranking going on in that method. The docstring is wrong. I think the sorting happens on this line inside get_scores instead

[k for k, v in sorted(scores.items(), key=lambda item: item[1], reverse=True)]

I just had another look at the QAEvaluator model, and it looks like it's just an instance of transformers.BertForSequenceClassification which is a BERT model with a linear layer with a binary classification head on top (with 2 outputs). What I did here was take the raw logit score from the positive class and use that directly as the "score" for ranking. You can see this in _evaluate_qa():

    @torch.no_grad()
    def _evaluate_qa(self, encoded_qa_pair: torch.tensor) -> float:
        """Takes an encoded QA pair and returns a score."""
        output = self.qae_model(**encoded_qa_pair)
        return output[0][0][1]

output has several values which can be accessed via indices (or using named parameters these days). output[0] gives you the logits, which look something like [[-2.3, 4.1]], where -2.3 is the score of the negative class and 4.1 is the score of the positive class.

Looking back on it this was definitely not the best approach. I appear to have trained a binary classification model and then used it as a regression/ranking model for inference. I would probably approach this differently if I were to do it again.

eQX8HNMTU2 commented 1 year ago

Hey, thanks for your reposnse. I'm thinking of using another tool that does the question generation. It also had a question answering feature, where you'd give it a text and it would be able to generate a response based on the whole text given as context. Now for questions where no factual statements are written in the text, i.e. questions that require a broader perspective rather than a few words taken from somewhere, it would sometimes output gibberish. Is there some way I can utilize the question answer evaluation tool on that? if so, how do I use it, since the you said you weren't using it right. I need basically something like a percentage of how well the question and answer match. s.t. when the user asks about a location, and the response would be some date or some object etc, the qa evaluator would output a low value.

i need this for my thesis