Closed AliOsm closed 4 years ago
Did you try with the default values for these hyper parameters? Are you saving checkpoints every X steps? If so, how many?
Hello @kinoute,
I tried with the default values and with many other values. I'm saving the checkpoints after each epoch only and I'm using the last epoch to predict. The problem is the model loss stay at 0.7xxx all the fine-tuning time, it is not converging at all.
I only tested XLM Roberta once but IIRC, at the beginning, metrics were almost 0 (I did binary text classification) for early checkpoints.
I will see tomorrow if I can find my hyperparameters for this model.
I don't know the size of your dataset but you could enable checkpoints every few steps to see more how the train loss and eval loss are doing. After more than a hundred experiments with this library and distilbert, what I know is I could not get a nice model if I only save/evaluate every epoch and pick the final model. For a dataset of around 10k entries, I set up the batch size to 16 and a checkpoint every 50 steps.
Because of the batch size there are some up and down of the eval loss but it decreases. At the end I pick the best model checkpoint according to the eval loss by looking into the csv training file.
You don't need to do that anymore since early stopping was recently added.
So to sum up I suggest you to save/evaluate every X steps depending of your dataset, enable early stopping with an average patience (or pick the best model manually by looking at the training csv file to test your model).
Le mar. 28 janv. 2020 à 21:15, Ali Hamdi Ali Fadel notifications@github.com a écrit :
Hello @kinoute https://github.com/kinoute,
I tried with the default values and with many other values. I'm saving the checkpoints after each epoch only and I'm using the last epoch to predict. The problem is the model loss stay at 0.7xxx all the fine-tuning time, it is not converging at all.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ThilinaRajapakse/simpletransformers/issues/199?email_source=notifications&email_token=AAEYJE34TYWTLYEYMNTVBTDRACG4NA5CNFSM4KMW25T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKEXRMQ#issuecomment-579434674, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYJE5OHKBGOQEIQMW5J3TRACG4NANCNFSM4KMW25TQ .
You can also watch the training loss on W&B or tensorboard to check if anything weird is going. I recently noticed that training some models for too long suddenly makes the model predict only 0s or 1s. I haven't had the time to investigate this issue properly yet, but maybe this is the same thing.
The problem is that, the running loss of the model while fine-tuning is not converging. It is not about the testing data only, the model didn't perform well on the training data also.
The size of the dataset is about 45K training examples.
I understand that. But you should add checkpoints within epochs. This won't improve your model but it will give you more insights on what's going on really since you will get way more metrics at different moments of the process.
I'm about to re-run XLM Roberta on my binary classifier and I will report what worked for me.
It would help to know whether the training loss decrease at any point before going crazy. That can be most easily observed by using one of the visualization options.
@AliOsm I checked on my binary classifier very quickly. Here are my code/args:
model = ClassificationModel('xlmroberta', 'xlm-roberta-base',
weight=weights,
args={
"weight_decay": 0,
"train_batch_size": 8,
"eval_batch_size": 8,
"sliding_window": False,
"max_seq_length": 512,
"num_train_epochs": 3,
"evaluate_during_training_steps": 200,
"use_early_stopping": False,
"fp16": False},
use_cuda=True)
I couldn't use either the large model on colab, either the base model with a batch_size
of 16. because of GPU memory issues. Therefore, I increased the checkpoint steps.
Here are the results:
We can see some weird behavior indeed. If I had trained only for 2 epochs (checkpoint 1456), I would literally have a random model if I've picked the final model.
But thanks to the checkpoints feature, we can see there are something to use at least and to test. The checkpoint 1000 (after more than epoch) have the smallest evaluation_loss and could be used for testing.
I'm not really concerned about the training_loss bumping that much, especially when using batch gradient descent on a fine-tuning task. But yes, this architecture seems hard to master. I had the same problem with previous XLM models.
The checkpoints at the steps 1200, 1400, and 1456 are curious indeed. The model seems to resort to giving pretty much the same label for all inputs. This happened to me a few times as well but my investigations didn't really yield anything useful. However, it seems that this issue is not specific to Simple Transformer. I observed this behaviour even when using the Transformers package directly.
Do you guys have any insights on this?
i was experiencing a similar issue trying to use xlm-roberta-large. I've been playing around a little and got some progress. With the base model and using a small dataset with run_glue (sst-2 task, modified slightly) from transformers it worked with all the defaults (including no specifying do_lower), f1 0.74. If you turn do_lower on it still works but not as well f1 0.66. With the large model without do_lower i get no progress f1 0.0. With do_lower it shows some drop in loss and an f1 0.08. For comparable results to the base i had to up the batch size to 16 from the default of 8 and then i got an f1 of 0.68.
Can you tell me whether you were lower casing or not, and what batch size. if you have a chance can you try some of the combinations i mention and see if anything works for you. I've been looking through the code and pretrain weights to see if anything else stands out nothing so far.
experimenting further with larger epoch numbers the batch size of 16 without do_lower worked achieving an f1 0f 0.7 and with steady drop in loss. i also confirmed that the vocab (sentence bpe) is case sensitive so should not require the do_lower flag. it looks to be that you need to get the other parameters (and possibly your dataset right to see results). May also be worth checking your data is producing a mix of labels for the training set.
in my case carefully adjusting the learning rate (with the existing scheduler) along with the number of epochs (as well as the previous increase in batch size) allowed me to get much better results (beating those with the base model to date). So there doesn't appear to be anything fundamentally wrong with the pre-trained model or core model code. Seems you need to do a broader sweep of the parameters in your case (assuming no data issues etc).
Thank you for your investigations and for sharing what you found, @antonyscerri !
I suppose the bottom line is that the larger models require a little more finesse to get good results compared to the base models. I'm still curious as to why the models seem to go "screw it!" and just output the same label. But I guess it's safe to postpone that investigation for now as I am a little tied up these days.
Ive got a basic grid search going at the moment and it does seem in my case at least there is a degree of instability which leads to poor results. I've not gone as far as this yet but freezing/unfreezing different layers may well help with some of this.
I have the similar problem, not sure why. When I fine tune on less data like 10K then I have good results but when I fine tune on 500K I have the following results from three different models :
camembert --> {'mcc': 0.7406990363469542, 'tp': 457, 'tn': 412, 'fp': 87, 'fn': 44, 'eval_loss': 0.27663073867559435, 'acc': 0.869}
camembert: {'mcc': 0.0, 'tp': 28629, 'tn': 0, 'fp': 28629, 'fn': 0, 'acc': 0.5, 'eval_loss': 0.6931502344755528} xlmroberta: {'mcc': 0.0, 'tp': 28629, 'tn': 0, 'fp': 28629, 'fn': 0, 'acc': 0.5, 'eval_loss': 0.6932064594753496} flaubert: {'mcc': 0.0, 'tp': 28629, 'tn': 0, 'fp': 28629, 'fn': 0, 'acc': 0.5, 'eval_loss': 0.6931530762797928}
The larger models seem to be unstable. It's not the size of the dataset that's causing the issue but the longer training time (or training steps).
It can yield results but it seems to be a careful balancing act. Dropping the learning rate significantly can help in some cases, but longer runs with higher epochs could still result in complete failure so stopping conditions would be necessary. I had to go look at a bunch of other things but i'd be interested in hearing if anyone looks at keeping the base layers frozen for a period (if that applicable to their target model).
@ThilinaRajapakse and @antonyscerri , Thank you for the response. I got your point, and based on the previous discussion in the thread I tried with lesser learning rate and early stopping and apparently got good result. Since, I have french data to classify and the only two option I have here is camembert and flaubert, and my plan to learn the model on the complete data, any suggestion for tuning the hyper-parameters would be a great help.
camembert: < 'learning_rate': 2e-5 and 'use_early_stopping': True and 500K train set > {'mcc': 0.783252254138539, 'tp': 2465, 'tn': 1945, 'fp': 566, 'fn': 23, 'acc': 0.8821764352870574, 'eval_loss': 0.2191259129577951} The training stopped in the middle due to stopping criterion "early_stopping_patience": 3.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
In my use case, 20 epoch (not 20 steps) with lower learning rate does converge very well. I know this has overfitting risk, but that's the only parameter combination which works. It's quite different from other model such as Roberta: only a few epochs finetune will generate decent results.
In my use case, 20 epoch (not 20 steps) with lower learning rate does converge very well. I know this has overfitting risk, but that's the only parameter combination which works. It's quite different from other model such as Roberta: only a few epochs finetune will generate decent results.
Out of curiosity, what learning rate did you use?
I tried to fine-tune XLM-Roberta Large model on Google Colab environment for 3 epochs using
1e-5
learning rate, 16 batch size, 2 accumulative steps and 120 warmup steps. But the loss didn't converge and the model gives random predictions after the fine-tuning.I used the sentence pair minimal example as a starting point.
Do you have any idea?