XuezheMax / NeuroNLP2

Deep neural models for core NLP tasks (Pytorch version)
GNU General Public License v3.0
440 stars 89 forks source link

RuntimeError: CUDA error: an illegal instruction was encountered #41

Closed yzhangcs closed 4 years ago

yzhangcs commented 4 years ago

Hi, thanks for your nice project. I encountered this error when I tried to train the NeuroMST model. Below are some error logs:

./scripts/run_neuromst.sh 
Namespace(amsgrad=False, batch_size=32, beam=1, beta1=0.9, beta2=0.9, char_embedding='random', char_path=None, config='configs/parsing/neuromst.json', cuda=True, dev='data/ptb/dev.conllx', eps=0.0001, freeze=False, grad_clip=5.0, learning_rate=0.001, loss_type='token', lr_decay=0.999995, mode='train', model_path='models/parsing/neuromst/', num_epochs=400, optim='adam', punctuation=['.', '``', "''", ':', ','], reset=20, test='data/ptb/test.conllx', train='data/ptb/train.conllx', unk_replace=0.5, warmup_steps=40, weight_decay=0.0, word_embedding='sskip', word_path='data/glove.6B.100d.gz')
loading embedding: sskip from data/glove.6B.100d.gz
2020-05-23 17:16:35,234 - Parsing - INFO - Creating Alphabets
2020-05-23 17:16:35,265 - Create Alphabets - INFO - Word Alphabet Size (Singleton): 37377 (14112)
2020-05-23 17:16:35,265 - Create Alphabets - INFO - Character Alphabet Size: 83
2020-05-23 17:16:35,265 - Create Alphabets - INFO - POS Alphabet Size: 48
2020-05-23 17:16:35,265 - Create Alphabets - INFO - Type Alphabet Size: 48
2020-05-23 17:16:35,266 - Parsing - INFO - Word Alphabet Size: 37377
2020-05-23 17:16:35,266 - Parsing - INFO - Character Alphabet Size: 83
2020-05-23 17:16:35,266 - Parsing - INFO - POS Alphabet Size: 48
2020-05-23 17:16:35,266 - Parsing - INFO - Type Alphabet Size: 48
2020-05-23 17:16:35,266 - Parsing - INFO - punctuations(5): . `` '' , :
word OOV: 1017
2020-05-23 17:16:35,356 - Parsing - INFO - constructing network...
2020-05-23 17:16:42,994 - Parsing - INFO - Network: NeuroMST-FastLSTM, num_layer=3, hidden=512, act=elu
2020-05-23 17:16:42,994 - Parsing - INFO - dropout(in, out, rnn): variational(0.33, 0.33, [0.33, 0.33])
2020-05-23 17:16:42,994 - Parsing - INFO - # of Parameters: 22298677
2020-05-23 17:16:42,994 - Parsing - INFO - Reading Data
Reading data from data/ptb/train.conllx
reading data: 10000
reading data: 20000
reading data: 30000
Total number of data: 39832
Reading data from data/ptb/dev.conllx
Total number of data: 1700
Reading data from data/ptb/test.conllx
Total number of data: 2416
2020-05-23 17:16:55,055 - Parsing - INFO - training: #training data: 39831, batch: 32, unk replace: 0.50
Epoch 1 (adam, betas=(0.9, 0.900), eps=1.0e-04, amsgrad=False, lr=0.000000, lr decay=0.999995, grad clip=5.0, l2=0.0e+00): 
CUDA runtime error: an illegal instruction was encountered (73) in magma_dgetrf_batched at /opt/conda/conda-bld/magma-cuda92_1564975048006/work/src/dgetrf_batched.cpp:213
CUDA runtime error: an illegal instruction was encountered (73) in magma_dgetrf_batched at /opt/conda/conda-bld/magma-cuda92_1564975048006/work/src/dgetrf_batched.cpp:214
CUDA runtime error: an illegal instruction was encountered (73) in magma_dgetrf_batched at /opt/conda/conda-bld/magma-cuda92_1564975048006/work/src/dgetrf_batched.cpp:215
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda92_1564975048006/work/interface_cuda/interface.cpp:944
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda92_1564975048006/work/interface_cuda/interface.cpp:945
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda92_1564975048006/work/interface_cuda/interface.cpp:946
Traceback (most recent call last):
  File "parsing.py", line 651, in <module>
    train(args)
  File "parsing.py", line 360, in train
    loss_arc, loss_type = network.loss(words, chars, postags, heads, types, mask=masks)
  File "neuronlp2/models/parsing.py", line 281, in loss
    loss_arc = self.treecrf.loss(arc[0], arc[1], heads, mask=mask)
  File "neuronlp2/nn/crf.py", line 269, in loss
    z = torch.logdet(L)
RuntimeError: CUDA error: an illegal instruction was encountered

I wonder what went wrong with me. By observing the outputs, I found that at the first epoch, the Laplacian matrix is in the following form:

[ x -1 -1
 -1  x -1
 -1 -1  x]

where x corresponds a summation of exponents. Is it prone to cause the numerical overflow when calculating the determinants? I would be appreciated if given any suggestions.

XuezheMax commented 4 years ago

I am not sure why you encountered this error. In this implementation, we have some strategies to prevent from numerical overflow. For example, when calculating logdet(L), we first convert L to double precision. Second, during training, we skip mini-batches where NaN gradients occur.

For the CUDA error in your experiments, please print out L matrix to see if there are some illegal numbers such as inf or NaN.

yzhangcs commented 4 years ago

Thanks for your reply. It seems to be the bug of the device (Titan V) itself. I try to run the model on GTX 2080Ti and it works.

yzhangcs commented 4 years ago

Hi, just a tip: the overflow/underflow issues can be well-solved by torch.slogdet (see this link).