Closed pvcastro closed 4 years ago
8GB of GPU memory is not a lot to train BERT with. It looks like you're using a BERT-Base model, which helps, but it'll still be tight. How big is the vocab for that model? The vocab takes up a lot of space.
You say this used to work before?
Hi @dirkgr , thanks for the reply. The training config below currently works with AllenNLP 0.9.1. I can train it without any issues with a batch size of 64, if I'm not mistaken. Batch size 32 certainly works. This issue raised after upgrading to the current 1.0.0-rc. And even with batch size 4 still throws memory error.
The vocab length is 29.794.
{
"dataset_reader": {
"type": "conll2003",
"coding_scheme": "BIOUL",
"tag_label": "ner",
"token_indexers": {
"bert": {
"type": "bert-pretrained",
"do_lowercase": false,
"pretrained_model": "https://neuralmind-ai.s3.us-east-2.amazonaws.com/nlp/bert-base-portuguese-cased/vocab.txt",
"truncate_long_sequences": false,
"use_starting_offsets": true
}
}
},
"iterator": {
"type": "bucket",
"batch_size": 128,
"cache_instances": true,
"sorting_keys": [
[
"tokens",
"num_tokens"
]
]
},
"model": {
"type": "crf_tagger",
"calculate_span_f1": true,
"constrain_crf_decoding": true,
"dropout": 0.5,
"encoder": {
"type": "lstm",
"bidirectional": true,
"dropout": 0.5,
"hidden_size": 200,
"input_size": 768,
"num_layers": 2
},
"include_start_end_transitions": false,
"label_encoding": "BIOUL",
"text_field_embedder": {
"allow_unmatched_keys": true,
"embedder_to_indexer_map": {
"bert": [
"bert",
"bert-offsets"
]
},
"token_embedders": {
"bert": {
"type": "bert-pretrained",
"pretrained_model": "https://datalawyer-models.s3.amazonaws.com/bert/bert-base-portuguese-cased.tar.gz"
}
}
}
},
"train_data_path": "train.conll",
"validation_data_path": "dev.conll",
"test_data_path": "test.conll",
"trainer": {
"cuda_device": 0,
"num_epochs": 50,
"num_serialized_models_to_keep": 1,
"optimizer": {
"type": "bert_adam",
"lr": 0.0005
},
"patience": 15,
"should_log_learning_rate": true,
"validation_metric": "+f1-measure-overall"
},
"evaluate_on_test": true
}
The difference is that the old one would not train BERT, it would just run it. In other words, the BERT layers were frozen. That way, it doesn't have to store gradients or activations for those layers, but of course the performance will be worse.
The new version does not have this capability, but it will be quick to add. I'll try to get it done next week.
I see. So @dirkgr , suppose I run this one with an appropriate batch in a V100. Does this mean the model with the trainable BERT would have a lot more parameters and would, thus, be a lot slower and less scalable than my current 0.9.1 setting?
Yes, that's what it means.
Technically speaking, the number of parameters it the same, but the ones from BERT are not trained and receive no updates.
If the parameters of a trained final model are the same as the current version, only being trained, this doesn't mean that at least for inference and predictions the model would have the same performance as in 0.9.1?
It should, yes, but I'm not sure how easy it is to load the model correctly in the new version.
@dirkgr I managed to train with with a batch size 8 in a V100, but with the following outcome. Just switching the config to the new token indexer and embedder didn't manage to learn anything. Anything else I should be setting in the config?
Metrics: {
"best_epoch": 0,
"peak_cpu_memory_MB": 3553.98,
"peak_gpu_0_memory_MB": 31044,
"peak_gpu_1_memory_MB": 12609,
"peak_gpu_2_memory_MB": 31042,
"peak_gpu_3_memory_MB": 16120,
"peak_gpu_4_memory_MB": 10572,
"peak_gpu_5_memory_MB": 25712,
"peak_gpu_6_memory_MB": 12884,
"peak_gpu_7_memory_MB": 1865,
"training_duration": "0:23:44.686729",
"training_start_epoch": 0,
"training_epochs": 4,
"epoch": 4,
"training_accuracy": 0.8868650078312007,
"training_accuracy3": 0.9365931321847573,
"training_precision-overall": 0.0,
"training_recall-overall": 0.0,
"training_f1-measure-overall": 0.0,
"training_loss": 67.34692806199419,
"training_reg_loss": 0.0,
"training_cpu_memory_MB": 3553.98,
"training_gpu_0_memory_MB": 31044,
"training_gpu_1_memory_MB": 12489,
"training_gpu_2_memory_MB": 31042,
"training_gpu_3_memory_MB": 16120,
"training_gpu_4_memory_MB": 10572,
"training_gpu_5_memory_MB": 25712,
"training_gpu_6_memory_MB": 12884,
"training_gpu_7_memory_MB": 1865,
"validation_accuracy": 0.8833699184916705,
"validation_accuracy3": 0.9352187452141508,
"validation_precision-overall": 0.0,
"validation_recall-overall": 0.0,
"validation_f1-measure-overall": 0.0,
"validation_loss": 65.62932859045087,
"validation_reg_loss": 0.0,
"best_validation_accuracy": 0.8833699184916705,
"best_validation_accuracy3": 0.9364779553150577,
"best_validation_precision-overall": 0.0,
"best_validation_recall-overall": 0.0,
"best_validation_f1-measure-overall": 0.0,
"best_validation_loss": 138.4557712102177,
"best_validation_reg_loss": 0.0,
"test_accuracy": 0.8963938666802811,
"test_accuracy3": 0.943142560541856,
"test_precision-overall": 0.0,
"test_recall-overall": 0.0,
"test_f1-measure-overall": 0.0,
"test_loss": 124.76361329386933
}
To train BERT, 1e-4
is a pretty high learning rate, and you are not using a LR scheduler, and no grad norm. Look at one of the transformer based training configs (I recommend TransformerQA.) to see the setup I usually use to train transformers.
Thanks @dirkgr ! With this setting I'm able to train the model with a good performance. I'll use this as a base for allentune, as suggested in #4225 .
"trainer": {
"optimizer": {
"type": "huggingface_adamw",
"weight_decay": 0.0,
"parameter_groups": [[
["text_field_embedder", "encoder", "tag_projection_layer", "crf"],
{"weight_decay": 0}
]],
"lr": 2e-5,
"eps": 1e-8
},
"learning_rate_scheduler": {
"type": "slanted_triangular",
//"num_epochs": epochs,
"cut_frac": 0.1,
},
"grad_norm": 1.0,
"grad_clipping": 1.0,
"cuda_device": 0,
"num_epochs": epochs,
"checkpointer": {
"num_serialized_models_to_keep": 1
},
"patience": 5,
"validation_metric": "+f1-measure-overall"
},
Describe the bug I'm trying to adapt my 0.9.0 version of NER training using BERT to the new version, so I can run some tests, as suggested by @schmmd here. However, I'm not being able to do so, I'm getting the following error during the first epoch of the training. The error happens regardless of the batch size I choose. For my GTX 1070 GPU I was able to run a BERT training with batch size 32, but now I went all the way down to 4 and kept getting the error.
To Reproduce This is the training config I used:
Command line I used:
allennlp train bert-base-pt.jsonnet --serialization-dir test --include-package allennlp_models
System (please complete the following information):
Additional context
If I use this exact same training config, but just switch to using ELMo tokens/embedders, nothing happens, training is conducted normally. Since the previous BERT embedder is not supported in 1.0.0rc anymore, I can't say if it's a problem with transformers or the new
pretrained_transformer_mismatched
.