ThilinaRajapakse / simpletransformers

Transformers for Information Retrieval, Text Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI
https://simpletransformers.ai/
Apache License 2.0
4.08k stars 727 forks source link

Large models aren't converge while fine-tuning #199

Closed AliOsm closed 4 years ago

AliOsm commented 4 years ago

I tried to fine-tune XLM-Roberta Large model on Google Colab environment for 3 epochs using 1e-5 learning rate, 16 batch size, 2 accumulative steps and 120 warmup steps. But the loss didn't converge and the model gives random predictions after the fine-tuning.

I used the sentence pair minimal example as a starting point.

Do you have any idea?

kinoute commented 4 years ago

Did you try with the default values for these hyper parameters? Are you saving checkpoints every X steps? If so, how many?

AliOsm commented 4 years ago

Hello @kinoute,

I tried with the default values and with many other values. I'm saving the checkpoints after each epoch only and I'm using the last epoch to predict. The problem is the model loss stay at 0.7xxx all the fine-tuning time, it is not converging at all.

kinoute commented 4 years ago

I only tested XLM Roberta once but IIRC, at the beginning, metrics were almost 0 (I did binary text classification) for early checkpoints.

I will see tomorrow if I can find my hyperparameters for this model.

I don't know the size of your dataset but you could enable checkpoints every few steps to see more how the train loss and eval loss are doing. After more than a hundred experiments with this library and distilbert, what I know is I could not get a nice model if I only save/evaluate every epoch and pick the final model. For a dataset of around 10k entries, I set up the batch size to 16 and a checkpoint every 50 steps.

Because of the batch size there are some up and down of the eval loss but it decreases. At the end I pick the best model checkpoint according to the eval loss by looking into the csv training file.

You don't need to do that anymore since early stopping was recently added.

So to sum up I suggest you to save/evaluate every X steps depending of your dataset, enable early stopping with an average patience (or pick the best model manually by looking at the training csv file to test your model).

Le mar. 28 janv. 2020 à 21:15, Ali Hamdi Ali Fadel notifications@github.com a écrit :

Hello @kinoute https://github.com/kinoute,

I tried with the default values and with many other values. I'm saving the checkpoints after each epoch only and I'm using the last epoch to predict. The problem is the model loss stay at 0.7xxx all the fine-tuning time, it is not converging at all.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ThilinaRajapakse/simpletransformers/issues/199?email_source=notifications&email_token=AAEYJE34TYWTLYEYMNTVBTDRACG4NA5CNFSM4KMW25T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKEXRMQ#issuecomment-579434674, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYJE5OHKBGOQEIQMW5J3TRACG4NANCNFSM4KMW25TQ .

ThilinaRajapakse commented 4 years ago

You can also watch the training loss on W&B or tensorboard to check if anything weird is going. I recently noticed that training some models for too long suddenly makes the model predict only 0s or 1s. I haven't had the time to investigate this issue properly yet, but maybe this is the same thing.

AliOsm commented 4 years ago

The problem is that, the running loss of the model while fine-tuning is not converging. It is not about the testing data only, the model didn't perform well on the training data also.

The size of the dataset is about 45K training examples.

kinoute commented 4 years ago

I understand that. But you should add checkpoints within epochs. This won't improve your model but it will give you more insights on what's going on really since you will get way more metrics at different moments of the process.

I'm about to re-run XLM Roberta on my binary classifier and I will report what worked for me.

ThilinaRajapakse commented 4 years ago

It would help to know whether the training loss decrease at any point before going crazy. That can be most easily observed by using one of the visualization options.

kinoute commented 4 years ago

@AliOsm I checked on my binary classifier very quickly. Here are my code/args:

model = ClassificationModel('xlmroberta', 'xlm-roberta-base', 
                            weight=weights, 
                            args={
                                  "weight_decay": 0,
                                  "train_batch_size": 8,
                                  "eval_batch_size": 8,
                                  "sliding_window": False,
                                  "max_seq_length": 512,
                                  "num_train_epochs": 3, 
                                  "evaluate_during_training_steps": 200,
                                  "use_early_stopping": False,
                                  "fp16": False},  
                                  use_cuda=True)

I couldn't use either the large model on colab, either the base model with a batch_size of 16. because of GPU memory issues. Therefore, I increased the checkpoint steps.

Here are the results:

Capture d’écran 2020-01-29 à 19 13 14

Unknown

We can see some weird behavior indeed. If I had trained only for 2 epochs (checkpoint 1456), I would literally have a random model if I've picked the final model.

But thanks to the checkpoints feature, we can see there are something to use at least and to test. The checkpoint 1000 (after more than epoch) have the smallest evaluation_loss and could be used for testing.

I'm not really concerned about the training_loss bumping that much, especially when using batch gradient descent on a fine-tuning task. But yes, this architecture seems hard to master. I had the same problem with previous XLM models.

ThilinaRajapakse commented 4 years ago

The checkpoints at the steps 1200, 1400, and 1456 are curious indeed. The model seems to resort to giving pretty much the same label for all inputs. This happened to me a few times as well but my investigations didn't really yield anything useful. However, it seems that this issue is not specific to Simple Transformer. I observed this behaviour even when using the Transformers package directly.

Do you guys have any insights on this?

antonyscerri commented 4 years ago

i was experiencing a similar issue trying to use xlm-roberta-large. I've been playing around a little and got some progress. With the base model and using a small dataset with run_glue (sst-2 task, modified slightly) from transformers it worked with all the defaults (including no specifying do_lower), f1 0.74. If you turn do_lower on it still works but not as well f1 0.66. With the large model without do_lower i get no progress f1 0.0. With do_lower it shows some drop in loss and an f1 0.08. For comparable results to the base i had to up the batch size to 16 from the default of 8 and then i got an f1 of 0.68.

Can you tell me whether you were lower casing or not, and what batch size. if you have a chance can you try some of the combinations i mention and see if anything works for you. I've been looking through the code and pretrain weights to see if anything else stands out nothing so far.

antonyscerri commented 4 years ago

experimenting further with larger epoch numbers the batch size of 16 without do_lower worked achieving an f1 0f 0.7 and with steady drop in loss. i also confirmed that the vocab (sentence bpe) is case sensitive so should not require the do_lower flag. it looks to be that you need to get the other parameters (and possibly your dataset right to see results). May also be worth checking your data is producing a mix of labels for the training set.

antonyscerri commented 4 years ago

in my case carefully adjusting the learning rate (with the existing scheduler) along with the number of epochs (as well as the previous increase in batch size) allowed me to get much better results (beating those with the base model to date). So there doesn't appear to be anything fundamentally wrong with the pre-trained model or core model code. Seems you need to do a broader sweep of the parameters in your case (assuming no data issues etc).

ThilinaRajapakse commented 4 years ago

Thank you for your investigations and for sharing what you found, @antonyscerri !

I suppose the bottom line is that the larger models require a little more finesse to get good results compared to the base models. I'm still curious as to why the models seem to go "screw it!" and just output the same label. But I guess it's safe to postpone that investigation for now as I am a little tied up these days.

antonyscerri commented 4 years ago

Ive got a basic grid search going at the moment and it does seem in my case at least there is a degree of instability which leads to poor results. I've not gone as far as this yet but freezing/unfreezing different layers may well help with some of this.

bhardwaj1230 commented 4 years ago

I have the similar problem, not sure why. When I fine tune on less data like 10K then I have good results but when I fine tune on 500K I have the following results from three different models :

On 10k train set:

camembert --> {'mcc': 0.7406990363469542, 'tp': 457, 'tn': 412, 'fp': 87, 'fn': 44, 'eval_loss': 0.27663073867559435, 'acc': 0.869}

On 500K train set

camembert: {'mcc': 0.0, 'tp': 28629, 'tn': 0, 'fp': 28629, 'fn': 0, 'acc': 0.5, 'eval_loss': 0.6931502344755528} xlmroberta: {'mcc': 0.0, 'tp': 28629, 'tn': 0, 'fp': 28629, 'fn': 0, 'acc': 0.5, 'eval_loss': 0.6932064594753496} flaubert: {'mcc': 0.0, 'tp': 28629, 'tn': 0, 'fp': 28629, 'fn': 0, 'acc': 0.5, 'eval_loss': 0.6931530762797928}

ThilinaRajapakse commented 4 years ago

The larger models seem to be unstable. It's not the size of the dataset that's causing the issue but the longer training time (or training steps).

antonyscerri commented 4 years ago

It can yield results but it seems to be a careful balancing act. Dropping the learning rate significantly can help in some cases, but longer runs with higher epochs could still result in complete failure so stopping conditions would be necessary. I had to go look at a bunch of other things but i'd be interested in hearing if anyone looks at keeping the base layers frozen for a period (if that applicable to their target model).

bhardwaj1230 commented 4 years ago

@ThilinaRajapakse and @antonyscerri , Thank you for the response. I got your point, and based on the previous discussion in the thread I tried with lesser learning rate and early stopping and apparently got good result. Since, I have french data to classify and the only two option I have here is camembert and flaubert, and my plan to learn the model on the complete data, any suggestion for tuning the hyper-parameters would be a great help.

camembert: < 'learning_rate': 2e-5 and 'use_early_stopping': True and 500K train set > {'mcc': 0.783252254138539, 'tp': 2465, 'tn': 1945, 'fp': 566, 'fn': 23, 'acc': 0.8821764352870574, 'eval_loss': 0.2191259129577951} The training stopped in the middle due to stopping criterion "early_stopping_patience": 3.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dwang888 commented 1 year ago

In my use case, 20 epoch (not 20 steps) with lower learning rate does converge very well. I know this has overfitting risk, but that's the only parameter combination which works. It's quite different from other model such as Roberta: only a few epochs finetune will generate decent results.

omgwenxx commented 1 month ago

In my use case, 20 epoch (not 20 steps) with lower learning rate does converge very well. I know this has overfitting risk, but that's the only parameter combination which works. It's quite different from other model such as Roberta: only a few epochs finetune will generate decent results.

Out of curiosity, what learning rate did you use?