Model doesn't learn - Githubissues

Mayer123 commented 5 years ago

Thanks for releasing the code, good work! However, when I tried running your code (Python 3.6, Pytorch 1.1) using the exact same command, training loss is not decreasing and accuracy stays at random. (I tried both base and large model, the results are the same) Do you have any insights on this issue?

NoviScl commented 5 years ago

Devlin et al. (2018) report that, on small datasets, BERT sometimes fails to train, yielding degenerate results. And I have encountered similar situations when training XLNet as well. A workaround is just to run for a few more times, try changing the random seed or slightly adjust the learning rate.

Mayer123 commented 5 years ago

Devlin et al. (2018) report that, on small datasets, BERT sometimes fails to train, yielding degenerate results. And I have encountered similar situations when training XLNet as well. A workaround is just to run for a few more times, try changing the random seed or slightly adjust the learning rate.

Thanks for the advice, but I have tried 4 different seeds and larger learning rate, none of these seems to help.

NoviScl commented 5 years ago

Can you check if you have downloaded the right pre-trained XLNet model? And maybe try smaller learning rate (e.g. 6e-6). And you can adjust the warmup steps and batch size as well. Let me know if that helps. Sometimes even using the same set of hyper-parameters generates different results :<

Mayer123 commented 5 years ago

Yeah, the models are downloaded from https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-pytorch_model.bin (I didn't change any part of your code). I tried 6e-6 learning rate, batch size of 1/2/3 (batch size 32/gradient accumulation 32, batch size 24/gradient accumulation 12, batch size 24/gradient accumulation 8). Still doesn't work.

NoviScl commented 5 years ago

Could you show me the training log (i.e. the logger output, the training losses and accuracy after each epoch & the model config)? And did you try if max_length 128 works? Meanwhile I will run it again on my machine to double check.

NoviScl commented 5 years ago

Hi, I just tried again and find that it's possible to get degenerate runs even with the same set of hyper-parameters. However, by restarting the training a few times, the codes can indeed reproduce the results. If you need, I can send you my trained checkpoint. Just a tip, if you find the accuracy after the first epoch is still about random performance, just restart the training.

Mayer123 commented 5 years ago

07/23/2019 22:34:41 - INFO - main - device cuda n_gpu 1 distributed training False 07/23/2019 22:34:41 - INFO - tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-spiece.model from cache at /home/kaixinm/.cache/torch/pytorch_transformers/dad589d582573df0293448af5109cb6981ca77239ed314e15ca63b7b8a318ddd.8b10bd978b5d01c21303cc761fc9ecd464419b3bf921864a355ba807cfbfafa8 07/23/2019 22:34:42 - INFO - modeling_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-config.json from cache at /home/kaixinm/.cache/torch/pytorch_transformers/c9cc6e53904f7f3679a31ec4af244f4419e25ebc8e71ebf8c558a31cbcf07fc8.ef1824921bc0786e97dc88d55eb17aabf18aac90f24bd34c0650529e7ba27d6f 07/23/2019 22:34:42 - INFO - modeling_utils - Model config { "attn_type": "bi", "bi_data": false, "clamp_len": -1, "d_head": 64, "d_inner": 3072, "d_model": 768, "dropout": 0.1, "end_n_top": 5, "ff_activation": "gelu", "finetuning_task": null, "initializer_range": 0.02, "layer_norm_eps": 1e-12, "mem_len": null, "n_head": 12, "n_layer": 12, "n_token": 32000, "num_labels": 2, "output_attentions": false, "output_hidden_states": false, "reuse_len": null, "same_length": false, "start_n_top": 5, "summary_activation": "tanh", "summary_last_dropout": 0.1, "summary_type": "last", "summary_use_proj": true, "torchscript": false, "untie_r": true }

07/23/2019 22:34:42 - INFO - modeling_utils - loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-pytorch_model.bin from cache at /home/kaixinm/.cache/torch/pytorchtransformers/distributed-1/24197ba0ce5dbfe23924431610704c88e2c0371afa49149360e4c823219ab474.7eac4fe898a021204e63c88c00ea68c60443c57f94b4bc3c02adbde6465745ac 07/23/2019 22:34:45 - INFO - modeling_utils - Weights of XLNetForSequenceClassification not initialized from pretrained model: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.weight', 'logits_proj.bias'] 07/23/2019 22:34:45 - INFO - modeling_utils - Weights from pretrained model not used in XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias'] 07/23/2019 22:35:08 - INFO - main - Running training 07/23/2019 22:35:08 - INFO - main - Num examples = 18348 07/23/2019 22:35:08 - INFO - main - Batch size = 2 07/23/2019 22:35:08 - INFO - main - Num steps = 764 07/23/2019 22:35:09 - INFO - main - Training loss: 1.3281731605529785, global step: 0 07/23/2019 22:39:49 - INFO - main - Training loss: 1.1105041134074685, global step: 50 07/23/2019 22:44:33 - INFO - main - Training loss: 1.1107451395270678, global step: 100 07/23/2019 22:49:18 - INFO - main - Training loss: 1.1104911775551256, global step: 150 07/23/2019 22:55:23 - INFO - main - Epoch: 1 07/23/2019 22:55:23 - INFO - main - Eval results 07/23/2019 22:55:23 - INFO - main - eval_accuracy = 0.2954434100930916 07/23/2019 22:55:23 - INFO - main - eval_loss = 1.100573520641774 07/23/2019 22:55:23 - INFO - main - global_step = 191 07/23/2019 22:55:23 - INFO - main - loss = 0.06936791063115205 07/23/2019 22:55:24 - INFO - main - Training loss: 1.0778791904449463, global step: 191 07/23/2019 23:00:02 - INFO - main - Training loss: 1.0940000135056238, global step: 241 07/23/2019 23:04:39 - INFO - main - Training loss: 1.0935547150797131, global step: 291 07/23/2019 23:09:17 - INFO - main - Training loss: 1.09369897916882, global step: 341

Above are the training log of an Epoch that I just started. I'm wondering how many times did you tried to get a working trail?

NoviScl commented 5 years ago

My experience is that the training is more stable when I was using max length 128 (the final performance will be ~69). It was less stable when using max length 256 and I had to keep adjusting learning rates and restart training to get a working model. Due to the randomness it's hard to say how many times you need to train (my own experience is about 5-6 times). I'd recommend that you write an automatic training script to automate the trying process. TBH I don't have very good explanations of why this is happening and how to prevent it. You may need to ask the original authors about that. Such cases could potentially reveal some limitations of XLNet/BERT.

Mayer123 commented 5 years ago

Thanks for your quick response. I've been trying with max length 128, batch size of 24/6, 24/4, 24/3, 32/16, 32/8, learning rate between 5e-6 to 1e-5. Still haven't seen a successful trail. I'm wondering if these hyperparameters are within the "successable" range? Also what's your recommended warm-up ratio?

NoviScl commented 5 years ago

Hi, I have updated the codes to support more hyper-parameter tuning. You may want to fork/clone the latest version to try. I have run the new codes again and can reproduce the results successfully. For example, this set of hyper-parameters works well on my machine: python run_xlnet_dream.py --data_dir=data --xlnet_model=xlnet-large-cased --output_dir=xlnet_dream --max_seq_length=128 --do_train --do_eval --train_batch_size=32 --eval_batch_size=2 --learning_rate=2e-5 --num_train_epochs=3 --warmup_steps=120 --weight_decay=0.0 --adam_epsilon=1e-6 --gradient_accumulation_steps=16

Mayer123 commented 5 years ago

Hi, I have updated the codes to support more hyper-parameter tuning. You may want to fork/clone the latest version to try. I have run the new codes again and can reproduce the results successfully. For example, this set of hyper-parameters works well on my machine: python run_xlnet_dream.py --data_dir=data --xlnet_model=xlnet-large-cased --output_dir=xlnet_dream --max_seq_length=128 --do_train --do_eval --train_batch_size=32 --eval_batch_size=2 --learning_rate=2e-5 --num_train_epochs=3 --warmup_steps=120 --weight_decay=0.0 --adam_epsilon=1e-6 --gradient_accumulation_steps=16

Thanks for your help! I just tried this set of hyper-parameters, the model finally starts to learn and I was able to get 67.4 Acc. It's just seems like magic to me, I don't see any "substantial" difference of these hyper-parameters compare to the ones I tried before. I'm wondering are there any other magical tricks when choosing these hyper-parameters besides max_seq_length=128?

NoviScl commented 5 years ago

I've checked my previous repo codes with the codes that I run on my own machine and realised that I accidentally added one line: random.shuffle(train_examples) in the training codes of my previous repo. This is wrong because each example if one triple of (passage, question, option) and every three of such triples should be grouped together as one complete InputExample (this corresponds to the three options of one question, which is implemented in the convert_examples_to_features function). So the correct way is to shuffle the grouped features instead of the un-grouped examples. Now that it's fixed, the degenerate run problem shouldn't be so bad and it should be easy to reproduce the results. You can try using max length 256/512 and tune the learning rates accordingly to achieve even better performances.

NoviScl / XLNet_DREAM

Model doesn't learn #1