deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.25k stars 1.89k forks source link

Fine-tuning the Reader on domain data #192

Closed anirbansaha96 closed 4 years ago

anirbansaha96 commented 4 years ago

This issue is not specific to Haystack, but I just wanted to know whether this is something that I can achieve using Haystack.

I wish to train two domain-specific models:

Domain 1: Constitution and related Legal Documents Domain 2: Technical and related documents. For Domain 1, I've access to a text-corpus with texts from the constitution and no question-context-answer tuples. For Domain 2, I've access to Question-Answer pairs.

Is it possible to fine-tune a light-weight BERT model for Question-Answering using just the data mentioned above?

If yes, what are the resources to achieve this task? Can I fine-tune a model like bert_uncased_L-2_H-128_A-2/1 for question-answering using the above data?

Some examples, from the huggingface/models library would be mrm8488/bert-tiny-5-finetuned-squadv2, sshleifer/tiny-distilbert-base-cased-distilled-squad, /twmkn9/albert-base-v2-squad2.

tholor commented 4 years ago

Hey @anirbansaha96 ,

In general, the process of training a QA model consists of the following steps:

1) Choose a pretrained language model (e.g. roberta-base or albert-base-v2) 2) Optionally: Run language model adaptation (i.e. continue "pretraining" on the MLM objective on your domain corpus) 3) Fine-tune your model for QA on a large dataset (e.g. SQuAD or natural questions) 4) Optionally: Continue fine-tuning on a smaller domain-specific QA dataset. From our own experience and a recent paper, 0.5k-2k samples are usually enough to get great performance even on very specialized domains.

From our experience 4) works better than 2). While 2) works pretty well for other tasks where you train afterwards on rather small downstream dataset from the domain, this is challenging in QA where you train in 3) "again" on mostly Wikipedia style language.

So for your two mentioned cases this means:

1) You can try Language model adaptation, but I wouldn't expect big gains there. 2) I would take a model that was already trained on SQuAD and then continue fine-tuning on your QA pairs (see Tutorial 2 for an example)

How many QA pairs do you have for Domain 2?

anirbansaha96 commented 4 years ago

How many QA pairs do you have for Domain 2?

I have about 1200 pairs, and I would expect that to increase to 2000 finally.

From our experience 4) works better than 2). While 2) works pretty well for other tasks where you train afterwards on rather small downstream dataset from the domain, this is challenging in QA where you train in 3) "again" on mostly Wikipedia style language.

Once the model at stage 2 learns a bit more of the semantics, would that help in performance (marginally) since I want to use a relatively small model like tinyBERT or BERT-small.

I would take a model that was already trained on SQuAD and then continue fine-tuning on your QA pairs (see Tutorial 2 for an example)

I have already given this a try, the problem as I described in this issue is: I have access to only Question-Answer pairs, so for me, I have access to Context but the answer would be equal to the context itself. How do I change this into a SQuAD style json file to fine-tune, doing this manually for 2000 question-answer pairs using the annotation tool is difficult for me?

What would your suggestion be for this?

anirbansaha96 commented 4 years ago

Also, if I do fine-tune a model using Masked Language Modeling using this link, how exactly do we use this model with Haystack?

How exactly do we use it with this statement reader = FARMReader(model_name_or_path="deepset/bert-large-uncased-whole-word-masking-squad2", use_gpu=False)

anirbansaha96 commented 4 years ago

Thank you @sshleifer for giving the permission to loop you in (reference: Twitter).

I wanted to know if a model fine-tuned using simpletransformers library ( or an alternative MaskedLanguageModelling implementation on transformers library) for example using the link here, would help improve performance in QA and what would be a good way to use the model fine-tuned using MLM for Question-Answering.

anirbansaha96 commented 4 years ago

@tholor how would a language model fine-tuned model using Masked Language Modeling using the simpletransformers library (@ThilinaRajapakse) be used in the Haystack implementation. This will help in domain-specific QA model fine-tuning implementations.

tholor commented 4 years ago

As mentioned above, language model adaptation is only an optional step before training downstream on a QA dataset. So in order to use such a domain-specific language model in Haystack, you will still need to train it on a QA dataset like SQuAD, Natural Questions or similar. You could run QA training in Haystack like in Tutorial 2 or in FARM with this script and exchange "lang_model" to your domain-specific one.

To stress this again, from our experience the effect of LM adaptation is negligible if you train (only) on SQuAD afterwards (which is Wikipedia language). Your time might be better invested in a few domain-specific QA labels.

sbhttchryy commented 4 years ago

Dear @tholor, in case I want to extend this task to QA on a German biomedical corpus, do you have any suggestions as to how to proceed with it? Thank you very much.

Timoeller commented 4 years ago

@sbhttchryy we are working on support for German QA and will add the solution here as well as in the issue you created in FARM directly. We should have a version working within the next 2 days.

Timoeller commented 4 years ago

We have trained an XLM-r large on SQuAD v2 and evaluated it on the German parts of XQuAD and MLQA. We uploaded it to the HF modelhub at: https://huggingface.co/deepset/xlm-roberta-large-squad2 (model card PR still pending)