Open wanngweiwei opened 4 years ago
I queried this problem to Dr. Lample. He is so kindhearted that he replied me quikly. Maybe I can paste his answer here:
Hi Weiwei,
I'm not sure what is happening, but this is the kind of issues that usually happen when one is indexing an array by a higher value than what is available (for instance the lookup table has 100 embeddings, but you query the word 105 or something). The problem with CUDA is that it's not clear where the issue is happening because it is running asynchronously.
Did you modify the code? What is the command you ran? Can you try the same command with the CUDA_LAUNCH_BLOCKING prefix? "CUDA_LAUNCH_BLOCKING=1 python ....." and see what happens? This should give a better error message about where the issue is exactly happening.
Also, would you mind asking the issue on the Github? In case someone else faces the same problem.
Thank you, Guillaume
I am so exciting to get his reply. Merci beaucoup beaucoup.
I have tried to add prefix "CUDA_LAUNCH_BLOCKING=1" ,then the bug is that,
File "
File "D:\LampleCharton2019\SymbolicMathematics-master\src\model\transformer.py", line 540, in generate_beam generated = generated[:, beam_idx]
RuntimeError: CUDA error: device-side assert triggered
when the prefix changed to "CUDA_LAUNCH_BLOCKING=0" then the bug is the same with that with no prefix about CUDA_LAUNCH_BLOCKING. it is
File "
File "D:\LampleCharton2019\SymbolicMathematics-master\src\model\transformer.py", line 544, in generate_beam cache[k] = (cache[k][0][beam_idx], cache[k][1][beam_idx])
RuntimeError: CUDA error: device-side assert triggered
Is there anybody can help me ?
Hi @wanngweiwei , sorry for the delay. The CUDA_LAUNCH_BLOCKING=1
is helpful, the error seems to come from this line generated = generated[:, beam_idx]
I don't understand how this error can happen though. Do you have the full command that you used to get this error? So I can try to reproduce it.
Also, did you make modifications in the code? Could you try to print the shape of generated
and the beam_idx
value, with print(generated.shape, beam_idx)
just before it fails?
Best, Guillaume
Thank you Dr. Lample, I tried to print something as your advise. But, please forgive me that I am a new learner on the seq2seq and Beam searching, even not familiar with the python. Can you give more guide here? Thank you so much.
Okay so generated
has the good shape. Not sure what is going on with beam_idx
though. These 794946954264578
huge values look like a bug. What version of PyTorch are you using?
Could you try to print:
print(sent_id, beam_size, beam_id)
just before
next_sent_beam.append((value, word_id, sent_id * beam_size + beam_id))
and see the output? This is what is converted to something weird.
Again, that would be helpful if you could provide me with the command you use to have this issue. I could try to debug and fix it on my side.
Thank you, Dear Lample my pytorch version is 1.3.0 the print shows below.
and the command I used is just the ipython notebook given in this code, they are
import os import numpy as np import sympy as sp import torch os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
from src.utils import AttrDict from src.envs import build_env from src.model import build_modules
from src.utils import to_cuda from src.envs.sympy_utils import simplify
assert os.path.isfile(model_path)
params = params = AttrDict({ 'env_name': 'char_sp', 'int_base': 10, 'balanced': False, 'positive': True, 'precision': 10, 'n_variables': 1, 'n_coefficients': 0, 'leaf_probs': '0.75,0,0.25,0', 'max_len': 512, 'max_int': 5, 'max_ops': 15, 'max_ops_G': 15, 'clean_prefix_expr': True, 'rewrite_functions': '', 'tasks': 'prim_fwd', 'operators': 'add:10,sub:3,mul:10,div:5,sqrt:4,pow2:4,pow3:2,pow4:1,pow5:1,ln:4,exp:4,sin:4,cos:4,tan:4,asin:1,acos:1,atan:1,sinh:1,cosh:1,tanh:1,asinh:1,acosh:1,atanh:1', 'cpu': False, 'emb_dim': 1024, 'n_enc_layers': 6, 'n_dec_layers': 6, 'n_heads': 8, 'dropout': 0, 'attention_dropout': 0, 'sinusoidal_embeddings': False, 'share_inout_emb': True, 'reload_model': model_path, })
env = build_env(params) x = env.local_dict['x']
modules = build_modules(env, params) encoder = modules['encoder'] decoder = modules['decoder']
F_infix = 'x tan(exp(x)/x)' F_infix = 'x cos(x2) * tan(x)' F_infix = 'cos(x*2 exp(x cos(x)))' F_infix = 'ln(cos(x + exp(x)) sin(x2 + 2) * exp(x) / x)'
F = sp.S(F_infix, locals=env.local_dict) F
f = F.diff(x) f
F_prefix = env.sympy_to_prefix(F) f_prefix = env.sympy_to_prefix(f) print(f"F prefix: {F_prefix}") print(f"f prefix: {f_prefix}")
x1_prefix = env.clean_prefix(['sub', 'derivative', 'f', 'x', 'x'] + f_prefix) x1 = torch.LongTensor( [env.eos_index] + [env.word2id[w] for w in x1_prefix] + [env.eos_index] ).view(-1, 1) len1 = torch.LongTensor([len(x1)]) x1, len1 = to_cuda(x1, len1) with torch.no_grad(): encoded = encoder('fwd', x=x1, lengths=len1, causal=False).transpose(0, 1)
beam_size = 10 with torch.nograd(): , _, beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200) assert len(beam) == 1 hypotheses = beam[0].hyp assert len(hypotheses) == beam_size
THEN, The Error cames in the In[11], cry....
Can you try to do print(idx, n_words)
just before the beam_id = idx // n_words
line?
Basically I want to find out the first line where some gigantic value is appearing.
Dear Dr, this printing show this:
Hope this can give you some useful infomation.
I see. So this is the next_words
variable which contains the huge values. Problem must come from this line:
next_scores, next_words = torch.topk(_scores, 2 * beam_size, dim=1, largest=True, sorted=True)
Can you try to inspect if there is anything wrong with the _scores
variable?
Maybe try to print it, if the printed matrix is too large.
I suspect there are some NaN
in _scores
, c.f. https://github.com/allenai/allennlp/issues/2028
It's very difficult for me to help like this, I really need to investigate on my computer. Can you tell me the command you ran / how I can reproduce this error?
I downloaded this repository containing codes, data sets, and models trained, and tried to run the commands in the ipython notebook given by Dr. Lample. But I get a bug that I cannot solve. The first 10 Inputs in the ipython notebooks run well, but for the In [11] to Decode with beam search, there throw out an error:
File "", line 109, in
, _, beam = decoder.generate_beam(encoded, len1, beam_size=beam_size, length_penalty=1.0, early_stopping=1, max_len=200)
File "D:\LampleCharton2019\SymbolicMathematics-master\src\model\transformer.py", line 544, in generate_beam cache[k] = (cache[k][0][beam_idx], cache[k][1][beam_idx])
RuntimeError: CUDA error: device-side assert triggered_
The environment in my computer is win10, anaconda3, python3.7.5, pytorch (gpu), torch.cuda.is_available() = true, two Nvidia quadro P4000, they work well in other programs.