Problem adapting NER training with BERT to pretrained_transformer_mismatched

pvcastro commented 4 years ago

Describe the bug I'm trying to adapt my 0.9.0 version of NER training using BERT to the new version, so I can run some tests, as suggested by @schmmd here. However, I'm not being able to do so, I'm getting the following error during the first epoch of the training. The error happens regardless of the batch size I choose. For my GTX 1070 GPU I was able to run a BERT training with batch size 32, but now I went all the way down to 4 and kept getting the error.

2020-05-21 18:08:24,624 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 1982
2020-05-21 18:08:24,625 - INFO - allennlp.training.trainer - Training
  0%|          | 0/922 [00:00<?, ?it/s]/opt/conda/conda-bld/pytorch_1587428398394/work/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of add_ is deprecated:
        add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
        add_(Tensor other, *, Number alpha)
accuracy: 0.8882, accuracy3: 0.9361, precision-overall: 0.0000, recall-overall: 0.0000, f1-measure-overall: 0.0000, loss: 209.7350, reg_loss: 0.0000 ||:  47%|####6     | 430/922 [02:51<03:16,  2.50it/s]
Traceback (most recent call last):
  File "/media/discoD/anaconda3/envs/allennlp/bin/allennlp", line 11, in <module>
    load_entry_point('allennlp', 'console_scripts', 'allennlp')()
  File "/media/discoD/repositorios/allennlp/allennlp/__main__.py", line 19, in run
    main(prog="allennlp")
  File "/media/discoD/repositorios/allennlp/allennlp/commands/__init__.py", line 92, in main
    args.func(args)
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 112, in train_model_from_args
    dry_run=args.dry_run,
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 171, in train_model_from_file
    dry_run=dry_run,
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 234, in train_model
    dry_run=dry_run,
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 431, in _train_worker
    metrics = train_loop.run()
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 493, in run
    return self.trainer.train()
  File "/media/discoD/repositorios/allennlp/allennlp/training/trainer.py", line 739, in train
    train_metrics = self._train_epoch(epoch)
  File "/media/discoD/repositorios/allennlp/allennlp/training/trainer.py", line 507, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "/media/discoD/repositorios/allennlp/allennlp/training/trainer.py", line 413, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp_models/ner/crf_tagger.py", line 207, in forward
    embedded_text_input = self.text_field_embedder(tokens)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/discoD/repositorios/allennlp/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 88, in forward
    token_vectors = embedder(**tensors, **forward_params_values)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/discoD/repositorios/allennlp/allennlp/modules/token_embedders/pretrained_transformer_mismatched_embedder.py", line 75, in forward
    token_ids, wordpiece_mask, type_ids=type_ids, segment_concat_mask=segment_concat_mask
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/discoD/repositorios/allennlp/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 120, in forward
    embeddings = self.transformer_model(**parameters)[0]
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 736, in forward
    encoder_attention_mask=encoder_extended_attention_mask,
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 407, in forward
    hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 368, in forward
    self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 314, in forward
    hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 234, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA out of memory. Tried to allocate 94.00 MiB (GPU 0; 7.92 GiB total capacity; 6.18 GiB already allocated; 93.62 MiB free; 6.36 GiB reserved in total by PyTorch)

To Reproduce This is the training config I used:

{
    "dataset_reader": {
        "type": "conll2003",
        "coding_scheme": "BIOUL",
        "tag_label": "ner",
        "token_indexers": {
            "tokens": {
                "type": "pretrained_transformer_mismatched",
                "model_name": "neuralmind/bert-base-portuguese-cased",
                "max_length": 512
            }
        }
    },
    "data_loader": {
        "batch_sampler": {
            "type": "bucket",
            "batch_size" : 8,
            "sorting_keys": [
                "tokens"
            ]
        }
    },
    "model": {
        "type": "crf_tagger",
        "dropout": 0.5,
        "encoder": {
            "type": "lstm",
            "bidirectional": true,
            "dropout": 0.5,
            "hidden_size": 200,
            "input_size": 768,
            "num_layers": 2
        },
        "include_start_end_transitions": false,
        "label_encoding": "BIOUL",
        "regularizer": {
            "regexes": [
                [
                    "scalar_parameters",
                    {
                        "alpha": 0.1,
                        "type": "l2"
                    }
                ]
            ]
        },
        "text_field_embedder": {
            "token_embedders": {
                "tokens": {
                    "type": "pretrained_transformer_mismatched",
                    "model_name": "neuralmind/bert-base-portuguese-cased",
                    "max_length": 512
                }
            }
        }
    },
    "train_data_path": "train.conll",
    "validation_data_path": "dev.conll",
    "test_data_path": "test.conll",
    "trainer": {
        "cuda_device": 0,
        "grad_norm": 5,
        "num_epochs": 50,
        "checkpointer": {
            "num_serialized_models_to_keep": 1
        },
        "optimizer": {
            "type": "adam",
            "lr": 0.001
        },
        "patience": 5,
        "validation_metric": "+f1-measure-overall"
    },
    "evaluate_on_test": true
}

Command line I used:

allennlp train bert-base-pt.jsonnet --serialization-dir test --include-package allennlp_models

System (please complete the following information):

OS: Linux-4.15.0-91-generic-x86_64-with-debian-buster-sid
Python version: 3.7.7
AllenNLP version: Installed 1.0.0rc5 from master
PyTorch version: 1.5.0
transformers version: 2.9.1

Additional context

If I use this exact same training config, but just switch to using ELMo tokens/embedders, nothing happens, training is conducted normally. Since the previous BERT embedder is not supported in 1.0.0rc anymore, I can't say if it's a problem with transformers or the new pretrained_transformer_mismatched.

dirkgr commented 4 years ago

8GB of GPU memory is not a lot to train BERT with. It looks like you're using a BERT-Base model, which helps, but it'll still be tight. How big is the vocab for that model? The vocab takes up a lot of space.

You say this used to work before?

pvcastro commented 4 years ago

Hi @dirkgr , thanks for the reply. The training config below currently works with AllenNLP 0.9.1. I can train it without any issues with a batch size of 64, if I'm not mistaken. Batch size 32 certainly works. This issue raised after upgrading to the current 1.0.0-rc. And even with batch size 4 still throws memory error.

The vocab length is 29.794.

{
    "dataset_reader": {
        "type": "conll2003",
        "coding_scheme": "BIOUL",
        "tag_label": "ner",
        "token_indexers": {
            "bert": {
                "type": "bert-pretrained",
                "do_lowercase": false,
                "pretrained_model": "https://neuralmind-ai.s3.us-east-2.amazonaws.com/nlp/bert-base-portuguese-cased/vocab.txt",
                "truncate_long_sequences": false,
                "use_starting_offsets": true
            }
        }
    },
    "iterator": {
        "type": "bucket",
        "batch_size": 128,
        "cache_instances": true,
        "sorting_keys": [
            [
                "tokens",
                "num_tokens"
            ]
        ]
    },
    "model": {
        "type": "crf_tagger",
        "calculate_span_f1": true,
        "constrain_crf_decoding": true,
        "dropout": 0.5,
        "encoder": {
            "type": "lstm",
            "bidirectional": true,
            "dropout": 0.5,
            "hidden_size": 200,
            "input_size": 768,
            "num_layers": 2
        },
        "include_start_end_transitions": false,
        "label_encoding": "BIOUL",
        "text_field_embedder": {
            "allow_unmatched_keys": true,
            "embedder_to_indexer_map": {
                "bert": [
                    "bert",
                    "bert-offsets"
                ]
            },
            "token_embedders": {
                "bert": {
                    "type": "bert-pretrained",
                    "pretrained_model": "https://datalawyer-models.s3.amazonaws.com/bert/bert-base-portuguese-cased.tar.gz"
                }
            }
        }
    },
    "train_data_path": "train.conll",
    "validation_data_path": "dev.conll",
    "test_data_path": "test.conll",
    "trainer": {
        "cuda_device": 0,
        "num_epochs": 50,
        "num_serialized_models_to_keep": 1,
        "optimizer": {
            "type": "bert_adam",
            "lr": 0.0005
        },
        "patience": 15,
        "should_log_learning_rate": true,
        "validation_metric": "+f1-measure-overall"
    },
    "evaluate_on_test": true
}

dirkgr commented 4 years ago

The difference is that the old one would not train BERT, it would just run it. In other words, the BERT layers were frozen. That way, it doesn't have to store gradients or activations for those layers, but of course the performance will be worse.

The new version does not have this capability, but it will be quick to add. I'll try to get it done next week.

pvcastro commented 4 years ago

I see. So @dirkgr , suppose I run this one with an appropriate batch in a V100. Does this mean the model with the trainable BERT would have a lot more parameters and would, thus, be a lot slower and less scalable than my current 0.9.1 setting?

dirkgr commented 4 years ago

Yes, that's what it means.

Technically speaking, the number of parameters it the same, but the ones from BERT are not trained and receive no updates.

pvcastro commented 4 years ago

If the parameters of a trained final model are the same as the current version, only being trained, this doesn't mean that at least for inference and predictions the model would have the same performance as in 0.9.1?

dirkgr commented 4 years ago

It should, yes, but I'm not sure how easy it is to load the model correctly in the new version.

pvcastro commented 4 years ago

@dirkgr I managed to train with with a batch size 8 in a V100, but with the following outcome. Just switching the config to the new token indexer and embedder didn't manage to learn anything. Anything else I should be setting in the config?

 Metrics: {
  "best_epoch": 0,
  "peak_cpu_memory_MB": 3553.98,
  "peak_gpu_0_memory_MB": 31044,
  "peak_gpu_1_memory_MB": 12609,
  "peak_gpu_2_memory_MB": 31042,
  "peak_gpu_3_memory_MB": 16120,
  "peak_gpu_4_memory_MB": 10572,
  "peak_gpu_5_memory_MB": 25712,
  "peak_gpu_6_memory_MB": 12884,
  "peak_gpu_7_memory_MB": 1865,
  "training_duration": "0:23:44.686729",
  "training_start_epoch": 0,
  "training_epochs": 4,
  "epoch": 4,
  "training_accuracy": 0.8868650078312007,
  "training_accuracy3": 0.9365931321847573,
  "training_precision-overall": 0.0,
  "training_recall-overall": 0.0,
  "training_f1-measure-overall": 0.0,
  "training_loss": 67.34692806199419,
  "training_reg_loss": 0.0,
  "training_cpu_memory_MB": 3553.98,
  "training_gpu_0_memory_MB": 31044,
  "training_gpu_1_memory_MB": 12489,
  "training_gpu_2_memory_MB": 31042,
  "training_gpu_3_memory_MB": 16120,
  "training_gpu_4_memory_MB": 10572,
  "training_gpu_5_memory_MB": 25712,
  "training_gpu_6_memory_MB": 12884,
  "training_gpu_7_memory_MB": 1865,
  "validation_accuracy": 0.8833699184916705,
  "validation_accuracy3": 0.9352187452141508,
  "validation_precision-overall": 0.0,
  "validation_recall-overall": 0.0,
  "validation_f1-measure-overall": 0.0,
  "validation_loss": 65.62932859045087,
  "validation_reg_loss": 0.0,
  "best_validation_accuracy": 0.8833699184916705,
  "best_validation_accuracy3": 0.9364779553150577,
  "best_validation_precision-overall": 0.0,
  "best_validation_recall-overall": 0.0,
  "best_validation_f1-measure-overall": 0.0,
  "best_validation_loss": 138.4557712102177,
  "best_validation_reg_loss": 0.0,
  "test_accuracy": 0.8963938666802811,
  "test_accuracy3": 0.943142560541856,
  "test_precision-overall": 0.0,
  "test_recall-overall": 0.0,
  "test_f1-measure-overall": 0.0,
  "test_loss": 124.76361329386933
}

dirkgr commented 4 years ago

To train BERT, 1e-4 is a pretty high learning rate, and you are not using a LR scheduler, and no grad norm. Look at one of the transformer based training configs (I recommend TransformerQA.) to see the setup I usually use to train transformers.

pvcastro commented 4 years ago

Thanks @dirkgr ! With this setting I'm able to train the model with a good performance. I'll use this as a base for allentune, as suggested in #4225 .

"trainer": {
        "optimizer": {
            "type": "huggingface_adamw",
            "weight_decay": 0.0,
            "parameter_groups": [[
                ["text_field_embedder", "encoder", "tag_projection_layer", "crf"],
                {"weight_decay": 0}
            ]],
            "lr": 2e-5,
            "eps": 1e-8
        },
        "learning_rate_scheduler": {
            "type": "slanted_triangular",
            //"num_epochs": epochs,
            "cut_frac": 0.1,
        },
        "grad_norm": 1.0,
        "grad_clipping": 1.0,
        "cuda_device": 0,
        "num_epochs": epochs,
        "checkpointer": {
            "num_serialized_models_to_keep": 1
        },
        "patience": 5,
        "validation_metric": "+f1-measure-overall"
    },

allenai / allennlp

Problem adapting NER training with BERT to pretrained_transformer_mismatched #4273