script for further pre-training on Reuters TRC2

vr25 commented 5 years ago

Hi,

Thank you for making the code available.

As per the readme file, I understand that there are two models:

language_model that has been further pre-trained on Reuters TRC2
classifier_model that has been fine-tuned on Financial Phrasebank.

finbert_training.ipynb is used to load the language_model and fine-tune it on Financial Phrasebank.

I was wondering if you could make the script used for further pre-training the language_model available too.

Thanks!

doguaraci commented 5 years ago

Hi,

For language model fine-tuning, we used this script: https://github.com/huggingface/transformers/blob/master/examples/run_lm_finetuning.py

vr25 commented 5 years ago

Thank you for pointing out the script.

I am still a little confused. I will explain my question further. I am actually looking for the script to create 1. language_model that has been further pre-trained on Reuters TRC2.

I think I should be looking at one of the following but I am not sure: a) run_generation.py b) create_pretraining_data.py c) run_pretraining.py

Since further pre-training is used to create domain-specific BERT, I was wondering if this is a supervised or unsupervised task (pre-training from scratch, for instance, original BERT pre-trained on Wikipedia), as in,

line 145 in the run_lm_finetuning.py requires labels: labels = inputs.clone()

As per what you pointed out in another issue for pre-training: The only format requirement for language model training is that sentences should be separated by one new-line and documents with two new lines. This can, of course, be changed in the language model pre-training script.

Thanks!

doguaraci commented 5 years ago

Fine-tuning the language model on a financial corpus is an unsupervised task. The labels you mention are probably the true values of masked tokens during training. So there is no need to give labels for lm fine-tuning.

ProsusAI / finBERT

script for further pre-training on Reuters TRC2 #1