Helsinki-NLP / Opus-MT

Open neural machine translation models and web services
MIT License
574 stars 71 forks source link

Reproduced crash on Opus-mt-en-de model using string "J" and "J-10" #71

Closed Qubitium closed 1 year ago

Qubitium commented 1 year ago

Try any of the of the two and translation on web UI will return "J..........." or "J-10............" after 16 seconds but in fact, it caused a server crash.

https://huggingface.co/Helsinki-NLP/opus-mt-en-de?text=J-10 https://huggingface.co/Helsinki-NLP/opus-mt-en-de?text=J

Env: Conda Pytorch 1.13, Cuda 11.7, transformers on GPU

The crash is also happening on CPU only device.

An error occurred, model: en->de, translating: ['J']
Stacktrace:
Traceback (most recent call last):
  File "/raid0/translate/app.py", line 202, in trans
    translated.extend(translator.translate(sents))
  File "/raid0/translate/translator.py", line 60, in translate
    return self.translator.translate(input_text)
  File "/raid0/translate/model.py", line 114, in translate
    return self._translate(input_text)
  File "/raid0/translate/model.py", line 96, in _translate
    translated = self.model.generate(**tokens, max_new_tokens=50000)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1577, in generate
    return self.beam_search(
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 2747, in beam_search
    outputs = self(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 1440, in forward
    outputs = self.model(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 1240, in forward
    decoder_outputs = self.decoder(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 1042, in forward
    layer_outputs = decoder_layer(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 195, in forward
    query_states = self.q_proj(hidden_states) * self.scaling
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())`

On opus-mt-es-fr model we saw another GPU crash with very same stack trace. The UI link (CPU) shows failed translation with gibberish output at end. On GPU it should stacktrace like previous.

https://huggingface.co/Helsinki-NLP/opus-mt-es-fr

- ¿Porqué crees que renté una habitación con una poza privada… Akane? – Le mordió el lóbulo. – Ella se encogió de hombros. – Para quitarte todas esas dudas de la cabeza.

Qubitium commented 1 year ago

Here is the full stack for input string J-10

/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [1,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
An error occurred, model: en->de, translating: ['J-10']
Stacktrace:
Traceback (most recent call last):
  File "/raid0/translate/app.py", line 202, in trans
    translated.extend(translator.translate(sents))
  File "/raid0/translate/translator.py", line 60, in translate
    return self.translator.translate(input_text)
  File "/raid0/translate/model.py", line 114, in translate
    return self._translate(input_text)
  File "/raid0/translate/model.py", line 96, in _translate
    translated = self.model.generate(**tokens, max_new_tokens=50000)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1577, in generate
    return self.beam_search(
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 2747, in beam_search
    outputs = self(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 1440, in forward
    outputs = self.model(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 1240, in forward
    decoder_outputs = self.decoder(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 1042, in forward
    layer_outputs = decoder_layer(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 195, in forward
    query_states = self.q_proj(hidden_states) * self.scaling
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())`
Qubitium commented 1 year ago

More model crashes Helsinki-NLP/opus-mt-en-ar, input: ['Freaky Friday']

/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1666642975993/work/aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
An error occurred, model: Helsinki-NLP/opus-mt-en->ar, translating: ['Freaky Friday']
Stacktrace:
Traceback (most recent call last):
  File "/raid0/translate/app.py", line 202, in trans
    translated.extend(translator.translate(sents))
  File "/raid0/translate/translator.py", line 95, in translate
    t1_translated = self.t1.translate(input_text)
  File "/raid0/translate/translator.py", line 60, in translate
    return self.translator.translate(input_text)
  File "/raid0/translate/model.py", line 117, in translate
    return self._translate(input_text)
  File "/raid0/translate/model.py", line 100, in _translate
    translated = self.model.generate(**tokens, max_new_tokens=2048)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1577, in generate
    return self.beam_search(
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 2766, in beam_search
    next_token_scores_processed = logits_processor(input_ids, next_token_scores)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/generation_logits_process.py", line 92, in __call__
    scores = processor(input_ids, scores)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/generation_logits_process.py", line 435, in __call__
    dynamic_banned_tokens = self._calc_banned_bad_words_ids(input_ids.tolist())
RuntimeError: CUDA error: device-side assert triggered
Qubitium commented 1 year ago

Closing issue. Using the transformer pipeline("translate".....Opus) avoids all the crashes.