Closed Sharing-Sam-Work closed 3 years ago
usr/bin/nvidia-modprobe: unrecognized option: "-s"
ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for usage information.
/usr/bin/nvidia-modprobe: unrecognized option: "-s"
ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for usage information.
This seems to indicate an issue with nvidia drivers. It's highly unlikely to be related to simple transformers specifically.
/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected ca
ll of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will re
sult in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
This should be safe to ignore. I'm not entirely sure where it's coming from (but it always happens), since the lr_scheduler.step()
isn't called before optimizer.step()
.
os.environ["TOKENIZERS_PARALLELISM"] = "true"
This warning/suggestion is given because both Simple Transformers and Tokenizers uses multiprocessing. But, it also seems to be harmless so far.
WARNING:root:NaN or Inf found in input tensor.
This is the issue, it suggests that there is some bad input data. Does it happen exactly between the 1st and 2nd epochs? That would be a little strange.
I'm running this right now and will update whether or not I see the same behaviour.
And, good job on the detailed issue!
Thanks for the answer, I don't know if it happened exactly between the 1st and 2nd epochs...
I ran the code for 3 epochs and didn't run into this issue. Is it possible that the data was somehow corrupted?
You could also try setting fp16=False
just in case this is being caused by some driver/GPU issue
Trying fp16=False, thank you! When I looked at your wandb for Electra, I noticed that you have a "wordpieces_prefix" parameter set to "##". However, it does not appear in my parameters in wandb run overview, even when I put it in the train_args:
train_args = {
"reprocess_input_data": False,
"overwrite_output_dir": True,
"num_train_epochs": 3,
"save_eval_checkpoints": True,
"save_model_every_epoch": False,
"learning_rate": 5e-4,
"weight_decay":1e-2,
"warmup_steps": 10000,
"fp16": False,
"train_batch_size": 32,
"eval_batch_size": 32,
"n_gpu":8, # added
"wordpieces_prefix": "##",
...
Is this normal behaviour?
"wordpieces_prefix" is not part of the Simple Transformers args so adding it to train_args
won't do anything. It's from the Huggingface tokenizer, but it shouldn't really affect anything. I think the reason it's there on my wandb project but isn't on yours is because of changes to the Huggingface Transformers library. It shouldn't affect anything since I didn't have any issues when I ran the code earlier today.
Same error in sampled_tokens = torch.multinomial(sample_probs, 1).view(-1)
but with other data. @Sharing-Sam-Work did you solve the problem by setting fp16=False
?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Problem solved setting fp16=False
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Describe the bug I cannot reproduce the results of simpletransformer training electra from scratch on esperanto language (https://towardsdatascience.com/understanding-electra-and-training-an-electra-language-model-3d33e3a9660d).
To Reproduce
Launch attached script in a screen with:
CUDA_VISIBLE_DEVICES=0 python run_mlm.py
env: packages in running environment is attached in env.txt.
FYI, I give you the following most important packages: cudatoolkit 11.0.221 h6bb024c_0
simpletransformers 0.60.4 pypi_0 pypi
transformers 4.2.2 pypi_0 pypi
pytorch 1.7.0 py3.8_cuda11.0.221_cudnn8.0.3_0 pytorch
tqdm 4.49.0 pypi_0 pypi
tokenizers 0.9.4 py38_0 huggingface
Expected behavior I expected the code to work, but it throws the error attached in image file error1_2.jpg and erroe2_2.jpg.
You can notice that there is a:nvidia error
You can notice that there is a user warning:
Screenshots
Script used is attached.
You can notice that I removed evaluation during training to accelerate training.
You can notice that there is the following added line as well to avoid warning:
FYI, Wandb link of training is here:
https://wandb.ai/sam_enac/Esperanto%20-%20ELECTRA/runs/xzrqcl7g?workspace=user-sam_enac
You can notice that around global step 436, there is a sudden increase in training loss.
You can notice this warning in-between epoch1 end and epoch 2 starts.
Finally, the full error is below:
Desktop (please complete the following information): system='Linux'
node='dormammu'
release='5.4.0-62-generic'
version='#70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021'
machine='x86_64'
processor='x86_64'
gpus: "GeForce RTX 2080 Ti"
N.B:
There are 8 gpus on the server, but I used only one.
In another run, I tried using 6 gpus (n_gpus=6) but the computation was unexpectedly slower and there was also the weird increase in training so I killed the process.
Below are the attached files.
env.txt run_mlm.txt
This is my first time reporting an issue, please don't hesitate to tell me if my report is missing something.