facebookresearch / SymbolicMathematics

Deep Learning for Symbolic Mathematics
Other
527 stars 116 forks source link

HELP ! RuntimeError: CUDA error: device-side assert triggered #2

Open wanngweiwei opened 4 years ago

wanngweiwei commented 4 years ago

I downloaded this repository containing codes, data sets, and models trained, and tried to run the commands in the ipython notebook given by Dr. Lample. But I get a bug that I cannot solve. The first 10 Inputs in the ipython notebooks run well, but for the In [11] to Decode with beam search, there throw out an error:

File "", line 109, in , _, beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200)

File "D:\LampleCharton2019\SymbolicMathematics-master\src\model\transformer.py", line 544, in generate_beam cache[k] = (cache[k][0][beam_idx], cache[k][1][beam_idx])

RuntimeError: CUDA error: device-side assert triggered_

The environment in my computer is win10, anaconda3, python3.7.5, pytorch (gpu), torch.cuda.is_available() = true, two Nvidia quadro P4000, they work well in other programs.

wanngweiwei commented 4 years ago

I queried this problem to Dr. Lample. He is so kindhearted that he replied me quikly. Maybe I can paste his answer here:

Hi Weiwei,

I'm not sure what is happening, but this is the kind of issues that usually happen when one is indexing an array by a higher value than what is available (for instance the lookup table has 100 embeddings, but you query the word 105 or something). The problem with CUDA is that it's not clear where the issue is happening because it is running asynchronously.

Did you modify the code? What is the command you ran? Can you try the same command with the CUDA_LAUNCH_BLOCKING prefix? "CUDA_LAUNCH_BLOCKING=1 python ....." and see what happens? This should give a better error message about where the issue is exactly happening.

Also, would you mind asking the issue on the Github? In case someone else faces the same problem.

Thank you, Guillaume

wanngweiwei commented 4 years ago

I am so exciting to get his reply. Merci beaucoup beaucoup.

I have tried to add prefix "CUDA_LAUNCH_BLOCKING=1" ,then the bug is that,

File "", line 110, in , , beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200)

File "D:\LampleCharton2019\SymbolicMathematics-master\src\model\transformer.py", line 540, in generate_beam generated = generated[:, beam_idx]

RuntimeError: CUDA error: device-side assert triggered

when the prefix changed to "CUDA_LAUNCH_BLOCKING=0" then the bug is the same with that with no prefix about CUDA_LAUNCH_BLOCKING. it is

File "", line 110, in , , beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200)

File "D:\LampleCharton2019\SymbolicMathematics-master\src\model\transformer.py", line 544, in generate_beam cache[k] = (cache[k][0][beam_idx], cache[k][1][beam_idx])

RuntimeError: CUDA error: device-side assert triggered

wanngweiwei commented 4 years ago

Is there anybody can help me ?

glample commented 4 years ago

Hi @wanngweiwei , sorry for the delay. The CUDA_LAUNCH_BLOCKING=1 is helpful, the error seems to come from this line generated = generated[:, beam_idx]

I don't understand how this error can happen though. Do you have the full command that you used to get this error? So I can try to reproduce it.

Also, did you make modifications in the code? Could you try to print the shape of generated and the beam_idx value, with print(generated.shape, beam_idx) just before it fails?

Best, Guillaume

wanngweiwei commented 4 years ago
print 2020-04-19 150031

Thank you Dr. Lample, I tried to print something as your advise. But, please forgive me that I am a new learner on the seq2seq and Beam searching, even not familiar with the python. Can you give more guide here? Thank you so much.

glample commented 4 years ago

Okay so generated has the good shape. Not sure what is going on with beam_idx though. These 794946954264578 huge values look like a bug. What version of PyTorch are you using?

Could you try to print: print(sent_id, beam_size, beam_id) just before next_sent_beam.append((value, word_id, sent_id * beam_size + beam_id)) and see the output? This is what is converted to something weird.

Again, that would be helpful if you could provide me with the command you use to have this issue. I could try to debug and fix it on my side.

wanngweiwei commented 4 years ago

Thank you, Dear Lample my pytorch version is 1.3.0 the print shows below.

screen 2020-04-19 233230

and the command I used is just the ipython notebook given in this code, they are

In [1]:

import os import numpy as np import sympy as sp import torch os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

from src.utils import AttrDict from src.envs import build_env from src.model import build_modules

from src.utils import to_cuda from src.envs.sympy_utils import simplify

In [2]:

assert os.path.isfile(model_path)

In [3]:

params = params = AttrDict({ 'env_name': 'char_sp', 'int_base': 10, 'balanced': False, 'positive': True, 'precision': 10, 'n_variables': 1, 'n_coefficients': 0, 'leaf_probs': '0.75,0,0.25,0', 'max_len': 512, 'max_int': 5, 'max_ops': 15, 'max_ops_G': 15, 'clean_prefix_expr': True, 'rewrite_functions': '', 'tasks': 'prim_fwd', 'operators': 'add:10,sub:3,mul:10,div:5,sqrt:4,pow2:4,pow3:2,pow4:1,pow5:1,ln:4,exp:4,sin:4,cos:4,tan:4,asin:1,acos:1,atan:1,sinh:1,cosh:1,tanh:1,asinh:1,acosh:1,atanh:1', 'cpu': False, 'emb_dim': 1024, 'n_enc_layers': 6, 'n_dec_layers': 6, 'n_heads': 8, 'dropout': 0, 'attention_dropout': 0, 'sinusoidal_embeddings': False, 'share_inout_emb': True, 'reload_model': model_path, })

In [4]:

env = build_env(params) x = env.local_dict['x']

In [5]:

modules = build_modules(env, params) encoder = modules['encoder'] decoder = modules['decoder']

In [6]:

F_infix = 'x tan(exp(x)/x)' F_infix = 'x cos(x2) * tan(x)' F_infix = 'cos(x*2 exp(x cos(x)))' F_infix = 'ln(cos(x + exp(x)) sin(x2 + 2) * exp(x) / x)'

In [7]:

F = sp.S(F_infix, locals=env.local_dict) F

In [8]:

f = F.diff(x) f

In [9]:

F_prefix = env.sympy_to_prefix(F) f_prefix = env.sympy_to_prefix(f) print(f"F prefix: {F_prefix}") print(f"f prefix: {f_prefix}")

In [10]:

x1_prefix = env.clean_prefix(['sub', 'derivative', 'f', 'x', 'x'] + f_prefix) x1 = torch.LongTensor( [env.eos_index] + [env.word2id[w] for w in x1_prefix] + [env.eos_index] ).view(-1, 1) len1 = torch.LongTensor([len(x1)]) x1, len1 = to_cuda(x1, len1) with torch.no_grad(): encoded = encoder('fwd', x=x1, lengths=len1, causal=False).transpose(0, 1)

In [11]:

beam_size = 10 with torch.nograd(): , _, beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200) assert len(beam) == 1 hypotheses = beam[0].hyp assert len(hypotheses) == beam_size

THEN, The Error cames in the In[11], cry....

glample commented 4 years ago

Can you try to do print(idx, n_words) just before the beam_id = idx // n_words line? Basically I want to find out the first line where some gigantic value is appearing.

wanngweiwei commented 4 years ago

Dear Dr, this printing show this: 1   2020-04-21 104753 2   2020-04-21 104753

Hope this can give you some useful infomation.

glample commented 4 years ago

I see. So this is the next_words variable which contains the huge values. Problem must come from this line: next_scores, next_words = torch.topk(_scores, 2 * beam_size, dim=1, largest=True, sorted=True)

Can you try to inspect if there is anything wrong with the _scores variable? Maybe try to print it, if the printed matrix is too large. I suspect there are some NaN in _scores , c.f. https://github.com/allenai/allennlp/issues/2028

It's very difficult for me to help like this, I really need to investigate on my computer. Can you tell me the command you ran / how I can reproduce this error?