deepset-ai / COVID-QA

API & Webapp to answer questions about COVID-19. Using NLP (Question Answering) and trusted data sources.
Apache License 2.0
343 stars 119 forks source link

Data Augmentation #17

Open andra-pumnea opened 4 years ago

andra-pumnea commented 4 years ago

Experiment with different methods for data augmentation, report results and compare to baseline.

borhenryk commented 4 years ago

I will check later on the back translation

stedomedo commented 4 years ago

There is a possibility to use PPDB to generate additional paraphrased questions: http://paraphrase.org/#/download

Timoeller commented 4 years ago

Any updates on creating more questions?

Maybe @HenrykBorzymowski can use the MS Azure translator here for backtranslation? They have free 2M chars per month I heard : )

borhenryk commented 4 years ago

I have tried the google/uda project (https://github.com/google-research/uda). It has a back-translation part that allows you to take existing sentences, translate them into French and then back into English with different temperature parameters which will increase the sample size of the existing dataset.

Unfortunately the repository is quite outdated and the packages with the given versions do not work anymore.

Please install these packages (with python==2.7) and then follow the instructions in the UDA readme file to make it work:

pip install tensorflow-gpu====1.15.2
install pip tensor2tensor==1.15.2
pip install tensorflow probability==0.7.0

The following command translates the provided sample file in the directory back_translate (google/uda). It automatically divides paragraphs into sentences, translates English sentences into French, and then translates them back into English. Go to the back_translate directory and execute it:

download bash.sh
bash run.sh

I tried some temperature settings (0.3, 0.5, 0.7, 0.9) for the eval_question_similarity_en.csv table and found that rather small temperatures work better for our case (0.3 or 0.5). With 0.7 and 0.9 we get quite a lot of random translations :D

Attached you will find the results if someone is interested :) This could help us to get more variance in our sentences and to be less dependent on certain words that appear in our training set.

eval_question_similarity_back_trans.xlsx