deepset-ai / COVID-QA

API & Webapp to answer questions about COVID-19. Using NLP (Question Answering) and trusted data sources.
Apache License 2.0
344 stars 121 forks source link

Results with custom dataset #112

Open aaronbriel opened 3 years ago

aaronbriel commented 3 years ago

Hello!

First of all, thank you again for your incredible contribution with not only this dataset, but most importantly with the Haystack toolset!

I was able to closely approximate the results of your paper when running https://github.com/deepset-ai/FARM/blob/master/examples/question_answering_crossvalidation.py, although I had to reduce batch_size to 25 to prevent RuntimeError: CUDA out of memory. Tried to allocate 540.00 MiB (GPU 0; 15.78 GiB total capacity; 14.29 GiB already allocated; 386.75 MiB free; 14.35 GiB reserved in total by PyTorch). This is using an Ubuntu 18.04 VM running a Tesla V100 GPU with 128G disk space. As mentioned, the results obtained were quite close: XVAL EM: 0.26151560178306094 XVAL f1: 0.5858967501101285

I created a custom Covid-19 dataset that combines a preprocessed/cleansed subset of the dataset from the paper "Collecting Verified COVID-19 Question Answer Pairs" (Poliak et al, 2020) and a SQuADified version of your dataset, faq_covidbert.csv. For the latter I used your annotation tool to map questions to chunks in the answers, treating the full answers as contexts.

I trained a model with this dataset using the hyperparameters you specify here: https://huggingface.co/deepset/roberta-base-squad2-covid#hyperparameters . Informal tests of various questions related to Covid-19 indicate superior responses generated from my model as opposed to roberta-base-squad2-covid, which isn't surprising as inspection of both datasets reveals that mine contains far more Covid-19-specific questions and answers.

However, when running question_answering_crossvalidation.py with my dataset the metric results are not as good as what is observed with your dataset or even with the baseline referenced in the paper. Here are the EM and f1 scores I obtained with my dataset: XVAL EM: 0.21554054054054053 XVAL f1: 0.4432141443807887

Can you provide any insight as to why this would be the case? Thank you so much!

aaronbriel commented 3 years ago

I'll assume that the low scores overall, as similar to what was stated in the paper, could be related to the complexity of the question/answer pairs combined with the large contexts and absence of multiple annotations per question.