Closed vr25 closed 5 years ago
Hi,
For language model fine-tuning, we used this script: https://github.com/huggingface/transformers/blob/master/examples/run_lm_finetuning.py
Thank you for pointing out the script.
I am still a little confused. I will explain my question further. I am actually looking for the script to create 1. language_model that has been further pre-trained on Reuters TRC2.
I think I should be looking at one of the following but I am not sure: a) run_generation.py b) create_pretraining_data.py c) run_pretraining.py
Since further pre-training is used to create domain-specific BERT, I was wondering if this is a supervised or unsupervised task (pre-training from scratch, for instance, original BERT pre-trained on Wikipedia), as in,
line 145 in the run_lm_finetuning.py requires labels: labels = inputs.clone()
As per what you pointed out in another issue for pre-training: The only format requirement for language model training is that sentences should be separated by one new-line and documents with two new lines. This can, of course, be changed in the language model pre-training script.
Thanks!
Fine-tuning the language model on a financial corpus is an unsupervised task. The labels you mention are probably the true values of masked tokens during training. So there is no need to give labels for lm fine-tuning.
Hi,
Thank you for making the code available.
As per the readme file, I understand that there are two models:
finbert_training.ipynb is used to load the language_model and fine-tune it on Financial Phrasebank.
I was wondering if you could make the script used for further pre-training the language_model available too.
Thanks!