What kind of text file require for fine-tuning the model?

deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

https://farm.deepset.ai

Apache License 2.0

1.74k stars 247 forks source link

What kind of text file require for fine-tuning the model? #244

Closed ankush20m closed 4 years ago

ankush20m commented 4 years ago

Hello,

I want to finetune the language model on domain-specific tasks. Could anyone tell me what kind of custom text file require for fine-tuning the model? Will it be okay if I put all sentences from my domain line-by-line and put it for fine-tuning?

Timoeller commented 4 years ago

Hey @ankush20m Yes, you are right, each line should contain a single sentence. You should additionally separate documents with a single blank line. Here you can find some small sample files in the right format. Hope that helps?

Can I shoot back a question? : ) For what use case and in which domain do you want to finetune your Language Model?

ankush20m commented 4 years ago

Hi, @Timoeller Thanks :) for the help. I want to fine-tune the model for domain deals with legal cases. Do you have any idea or resources which further helps me to do my task?

One more question, can I fine-tune the model with Sentence Pair Classification Tasks on the custom corpus and not a GLUE data?

Timoeller commented 4 years ago

Mhh, I am unsure if we mean the same thing here. Let me clarify just to be sure:

So with Language Models there are two types of finetuning,

Adopting/fine tuning the LM towards domain language. This is done in a self supervised way, where no labelled data is required, just plain text.
Fine tuning the LM for a downstream task like NER, document classification or one of the other tasks in GLUE. For this you need labelled data.

I thought you first wanted to do 1. and then you possibly have some labeled data and use case to do 2. Can you elaborate more on

I want to fine-tune the model for domain deals with legal cases.

then I might be able to help better.

ankush20m commented 4 years ago

Okay. Let me elaborate more. I am trying to solve 2 cases here and will check which one performs better.

I have a plain text file containing sentences per line. Using this I will try finetuning the model, just you said in a self-supervised way. Using this finetuned model, the sentence similarity task will be performed. I tried this many times, however similarity result fails every time.
I created custom labeled data, where I have sentence pairs and target labels as score i.e. are both sentences similar or not just like Microsoft Paraphrase Corpus MRPC. Using this I would like to perform Sequence Sentence Pair Classification and in this way, I fine-tuned the model and will use the same for sentence similarity tasks.
Or third case might be to create a labeled data containing a single sentence and its class and perform the same method explained in Case 2.

I am sure you will get my point, above this could you help me?

Timoeller commented 4 years ago

Nice, thanks for clarifying. This is something we also actively work on right now.

We have used the representation of Next Sentence Prediction to calculate document/sentence similarity. We observed that if the data for pretraining the LM and the actual similarity tasks are quite different the similarity value coming from next sentence prediction is mostly nonsense. You adopted the LM to the specific domain, so that it also fails seems strange to me. By the way, what do you mean with fails: complete nonsense or just bad results?
Do you have a binary classification problem or a score/probability for a pair of sentences? We are currently working on binary classification of text pairs in a separate branch, using MSMARCO dataset instead of MRPC. See an example here You could try working on that branch or wait for it to be merged in master.
This sounds like vanilla text classification to me, which is covered in master branch here

So yes, we can totally help. Maybe start off by first adopting your model to your text domain and then use the solution from 2. in combination with your custom labeled data.

ankush20m commented 4 years ago

Thanks, @Timoeller :) for the knowledge sharing. I visited your venture's website and observed you all are working same kind of task which I am looking for. I found there, LegalBERT, did you train such a model on legal cases?

Extending our conversation, on 2nd point: Yes, I use binary classification where labels are 0 and 1 and yes I explore the snippet you shared, I think this will help me.

Further, I will adopt your suggestions and try to dirt my hands by implementing transformer-based models :) and yes, if any issues come will share with you.

ankush20m commented 4 years ago

And on the 1st point: Similarity fails, that means I am not getting decent results which I am supposed to get. This may happen because I have used a very small amount of data to fine-tune.

Timoeller commented 4 years ago

About LegalBert: we got data from a large German legal database and adjusted the model accordingly. For some tasks this improves performance, for other tasks it doesn't (especially text classification). Though I think the true interactions between fine tuning and downstream task performance are much richer. It can depend on the training set size, the task difficulty, data for adopting the model to the domain (as you stated you did not use much) and many others.

Always happy to help. If anything comes up please open another dedicated issue so others can better find it. Closing this now, but feel free to update me on the progress.