Closed anirbansaha96 closed 4 years ago
Hey @anirbansaha96 ,
In general, the process of training a QA model consists of the following steps:
1) Choose a pretrained language model (e.g. roberta-base or albert-base-v2) 2) Optionally: Run language model adaptation (i.e. continue "pretraining" on the MLM objective on your domain corpus) 3) Fine-tune your model for QA on a large dataset (e.g. SQuAD or natural questions) 4) Optionally: Continue fine-tuning on a smaller domain-specific QA dataset. From our own experience and a recent paper, 0.5k-2k samples are usually enough to get great performance even on very specialized domains.
From our experience 4) works better than 2). While 2) works pretty well for other tasks where you train afterwards on rather small downstream dataset from the domain, this is challenging in QA where you train in 3) "again" on mostly Wikipedia style language.
So for your two mentioned cases this means:
1) You can try Language model adaptation, but I wouldn't expect big gains there. 2) I would take a model that was already trained on SQuAD and then continue fine-tuning on your QA pairs (see Tutorial 2 for an example)
How many QA pairs do you have for Domain 2?
How many QA pairs do you have for Domain 2?
I have about 1200 pairs, and I would expect that to increase to 2000 finally.
From our experience 4) works better than 2). While 2) works pretty well for other tasks where you train afterwards on rather small downstream dataset from the domain, this is challenging in QA where you train in 3) "again" on mostly Wikipedia style language.
Once the model at stage 2 learns a bit more of the semantics, would that help in performance (marginally) since I want to use a relatively small model like tinyBERT
or BERT-small
.
I would take a model that was already trained on SQuAD and then continue fine-tuning on your QA pairs (see Tutorial 2 for an example)
I have already given this a try, the problem as I described in this issue is: I have access to only Question-Answer pairs, so for me, I have access to Context but the answer would be equal to the context itself. How do I change this into a SQuAD style json file to fine-tune, doing this manually for 2000 question-answer pairs using the annotation tool is difficult for me?
What would your suggestion be for this?
Also, if I do fine-tune a model using Masked Language Modeling using this link, how exactly do we use this model with Haystack?
How exactly do we use it with this statement reader = FARMReader(model_name_or_path="deepset/bert-large-uncased-whole-word-masking-squad2", use_gpu=False)
Thank you @sshleifer for giving the permission to loop you in (reference: Twitter).
I wanted to know if a model fine-tuned using simpletransformers library ( or an alternative MaskedLanguageModelling implementation on transformers library) for example using the link here, would help improve performance in QA and what would be a good way to use the model fine-tuned using MLM for Question-Answering.
@tholor how would a language model fine-tuned model using Masked Language Modeling using the simpletransformers
library (@ThilinaRajapakse) be used in the Haystack
implementation. This will help in domain-specific QA model fine-tuning implementations.
As mentioned above, language model adaptation is only an optional step before training downstream on a QA dataset. So in order to use such a domain-specific language model in Haystack, you will still need to train it on a QA dataset like SQuAD, Natural Questions or similar. You could run QA training in Haystack like in Tutorial 2 or in FARM with this script and exchange "lang_model" to your domain-specific one.
To stress this again, from our experience the effect of LM adaptation is negligible if you train (only) on SQuAD afterwards (which is Wikipedia language). Your time might be better invested in a few domain-specific QA labels.
Dear @tholor, in case I want to extend this task to QA on a German biomedical corpus, do you have any suggestions as to how to proceed with it? Thank you very much.
@sbhttchryy we are working on support for German QA and will add the solution here as well as in the issue you created in FARM directly. We should have a version working within the next 2 days.
We have trained an XLM-r large on SQuAD v2 and evaluated it on the German parts of XQuAD and MLQA. We uploaded it to the HF modelhub at: https://huggingface.co/deepset/xlm-roberta-large-squad2 (model card PR still pending)
This issue is not specific to Haystack, but I just wanted to know whether this is something that I can achieve using Haystack.
I wish to train two domain-specific models:
Domain 1: Constitution and related Legal Documents Domain 2: Technical and related documents. For Domain 1, I've access to a text-corpus with texts from the constitution and no question-context-answer tuples. For Domain 2, I've access to Question-Answer pairs.
Is it possible to fine-tune a light-weight BERT model for Question-Answering using just the data mentioned above?
If yes, what are the resources to achieve this task? Can I fine-tune a model like
bert_uncased_L-2_H-128_A-2/1
for question-answering using the above data?Some examples, from the huggingface/models library would be mrm8488/bert-tiny-5-finetuned-squadv2, sshleifer/tiny-distilbert-base-cased-distilled-squad, /twmkn9/albert-base-v2-squad2.