ThilinaRajapakse / simpletransformers

Transformers for Information Retrieval, Text Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI
https://simpletransformers.ai/
Apache License 2.0
4.11k stars 727 forks source link

Inability to reproduce results of simpletransformer article using electra on esperanto data¶ #983

Closed Sharing-Sam-Work closed 3 years ago

Sharing-Sam-Work commented 3 years ago

Describe the bug I cannot reproduce the results of simpletransformer training electra from scratch on esperanto language (https://towardsdatascience.com/understanding-electra-and-training-an-electra-language-model-3d33e3a9660d).

To Reproduce

Launch attached script in a screen with:
CUDA_VISIBLE_DEVICES=0 python run_mlm.py

env: packages in running environment is attached in env.txt.
FYI, I give you the following most important packages: cudatoolkit 11.0.221 h6bb024c_0
simpletransformers 0.60.4 pypi_0 pypi
transformers 4.2.2 pypi_0 pypi
pytorch 1.7.0 py3.8_cuda11.0.221_cudnn8.0.3_0 pytorch
tqdm 4.49.0 pypi_0 pypi
tokenizers 0.9.4 py38_0 huggingface

Expected behavior I expected the code to work, but it throws the error attached in image file error1_2.jpg and erroe2_2.jpg.

You can notice that there is a:nvidia error

usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for usage information.

/usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for usage information.

You can notice that there is a user warning:

 /home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected ca
ll of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will re
sult in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "

Screenshots
Script used is attached.
You can notice that I removed evaluation during training to accelerate training.
You can notice that there is the following added line as well to avoid warning:

os.environ["TOKENIZERS_PARALLELISM"] = "true"

FYI, Wandb link of training is here:
https://wandb.ai/sam_enac/Esperanto%20-%20ELECTRA/runs/xzrqcl7g?workspace=user-sam_enac

You can notice that around global step 436, there is a sudden increase in training loss.

You can notice this warning in-between epoch1 end and epoch 2 starts.

WARNING:root:NaN or Inf found in input tensor.

Finally, the full error is below:

Traceback (most recent call last):
  File "run_mlm.py", line 73, in <module>
    main()
  File "run_mlm.py", line 65, in main
    model.train_model(
  File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/simpletransformers/language_modeling/language_modeling_model.py", line 376, in train_model
    global_step, training_details = self.train(
  File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/simpletransformers/language_modeling/language_modeling_model.py", line 641, in train
    outputs = model(inputs, labels=labels) if args.mlm else model(inputs, labels=labels)
  File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/simpletransformers/custom_models/models.py", line 533, in forward
    sampled_tokens = torch.multinomial(sample_probs, 1).view(-1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Desktop (please complete the following information): system='Linux'
node='dormammu'
release='5.4.0-62-generic'
version='#70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021'
machine='x86_64'
processor='x86_64'
gpus: "GeForce RTX 2080 Ti"

N.B:
There are 8 gpus on the server, but I used only one.
In another run, I tried using 6 gpus (n_gpus=6) but the computation was unexpectedly slower and there was also the weird increase in training so I killed the process.

Below are the attached files.

error1_2 error2_2 env.txt run_mlm.txt

This is my first time reporting an issue, please don't hesitate to tell me if my report is missing something.

ThilinaRajapakse commented 3 years ago
usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for usage information.

/usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for usage information.

This seems to indicate an issue with nvidia drivers. It's highly unlikely to be related to simple transformers specifically.

 /home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected ca
ll of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will re
sult in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "

This should be safe to ignore. I'm not entirely sure where it's coming from (but it always happens), since the lr_scheduler.step() isn't called before optimizer.step().

os.environ["TOKENIZERS_PARALLELISM"] = "true"

This warning/suggestion is given because both Simple Transformers and Tokenizers uses multiprocessing. But, it also seems to be harmless so far.

WARNING:root:NaN or Inf found in input tensor.

This is the issue, it suggests that there is some bad input data. Does it happen exactly between the 1st and 2nd epochs? That would be a little strange.

I'm running this right now and will update whether or not I see the same behaviour.

And, good job on the detailed issue!

Sharing-Sam-Work commented 3 years ago

Thanks for the answer, I don't know if it happened exactly between the 1st and 2nd epochs...

ThilinaRajapakse commented 3 years ago

I ran the code for 3 epochs and didn't run into this issue. Is it possible that the data was somehow corrupted?

ThilinaRajapakse commented 3 years ago

You could also try setting fp16=False just in case this is being caused by some driver/GPU issue

Sharing-Sam-Work commented 3 years ago

Trying fp16=False, thank you! When I looked at your wandb for Electra, I noticed that you have a "wordpieces_prefix" parameter set to "##". However, it does not appear in my parameters in wandb run overview, even when I put it in the train_args:

train_args = {
        "reprocess_input_data": False,
        "overwrite_output_dir": True,
        "num_train_epochs": 3,
        "save_eval_checkpoints": True,
        "save_model_every_epoch": False,
        "learning_rate": 5e-4,
        "weight_decay":1e-2,
        "warmup_steps": 10000,
        "fp16": False,
        "train_batch_size": 32,
        "eval_batch_size": 32,
        "n_gpu":8,  # added
        "wordpieces_prefix": "##",
        ...

Is this normal behaviour?

ThilinaRajapakse commented 3 years ago

"wordpieces_prefix" is not part of the Simple Transformers args so adding it to train_args won't do anything. It's from the Huggingface tokenizer, but it shouldn't really affect anything. I think the reason it's there on my wandb project but isn't on yours is because of changes to the Huggingface Transformers library. It shouldn't affect anything since I didn't have any issues when I ran the code earlier today.

flaviussn commented 3 years ago

Same error in sampled_tokens = torch.multinomial(sample_probs, 1).view(-1) but with other data. @Sharing-Sam-Work did you solve the problem by setting fp16=False ?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

flaviussn commented 3 years ago

Problem solved setting fp16=False

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.