NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.53k stars 2.36k forks source link

[BUG] ModuleNotFoundError: No module named 'scaled_softmax_cuda' #749

Open liuliuliu0605 opened 7 months ago

liuliuliu0605 commented 7 months ago

Describe the bug When I try to run single GPU T5 Pretraining with the script examples/pretrain_t5.sh, it outputs the following error:

ModuleNotFoundError: No module named 'scaled_softmax_cuda'

It seems that the code lacks of module scaled_softmax_cuda or do I need to install the relevant python module ?

Stack trace/logs

Traceback (most recent call last):
  File "/home/ubuntu/projects/Megatron-LM/pretrain_t5.py", line 239, in 
    pretrain(train_valid_test_datasets_provider, model_provider, ModelType.encoder_and_decoder,
  File "/home/ubuntu/projects/Megatron-LM/megatron/training.py", line 261, in pretrain
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ubuntu/projects/Megatron-LM/megatron/training.py", line 967, in train
    train_step(forward_step_func,
  File "/home/ubuntu/projects/Megatron-LM/megatron/training.py", line 532, in train_step
    losses_reduced = forward_backward_func(
  File "/home/ubuntu/projects/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 372, in forward_backward_no_pipelining
    output_tensor = forward_step(
  File "/home/ubuntu/projects/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 192, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/home/ubuntu/projects/Megatron-LM/pretrain_t5.py", line 176, in forward_step
    output_tensor = model(tokens_enc,
  File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/projects/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py", line 179, in forward
    return self.module(*inputs, **kwargs)
  File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/projects/Megatron-LM/megatron/model/module.py", line 190, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/projects/Megatron-LM/megatron/model/t5_model.py", line 118, in forward
    lm_output = self.language_model(encoder_input_ids,
  File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/projects/Megatron-LM/megatron/model/language_model.py", line 527, in forward
    decoder_output = self.decoder(
  File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/projects/Megatron-LM/megatron/model/transformer.py", line 1776, in forward
    hidden_states = layer(
  File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/projects/Megatron-LM/megatron/model/transformer.py", line 1210, in forward
    self.default_decoder_cross_attention(
  File "/home/ubuntu/projects/Megatron-LM/megatron/model/transformer.py", line 943, in default_decoder_cross_attention
    self.inter_attention(norm_output,
  File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/projects/Megatron-LM/megatron/model/transformer.py", line 798, in forward
    context_layer = self.core_attention(
  File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/projects/Megatron-LM/megatron/model/transformer.py", line 384, in forward
    attention_probs = self.scale_mask_softmax(attention_scores,
  File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/projects/Megatron-LM/megatron/model/fused_softmax.py", line 148, in forward
    return self.forward_fused_softmax(input, mask)
  File "/home/ubuntu/projects/Megatron-LM/megatron/model/fused_softmax.py", line 190, in forward_fused_softmax
    return ScaledSoftmax.apply(input, scale)
  File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ubuntu/projects/Megatron-LM/megatron/model/fused_softmax.py", line 80, in forward
    import scaled_softmax_cuda
ModuleNotFoundError: No module named 'scaled_softmax_cuda'

Environment (please complete the following information):

yuantailing commented 7 months ago

You may use either of the following solutions:

  1. The library scaled_softmax_cuda is contained in apex. You may install it from https://github.com/NVIDIA/apex .
  2. Add --no-masked-softmax-fusion to avoid the use of fused kernel.
liuliuliu0605 commented 7 months ago

You may use either of the following solutions:

  1. The library scaled_softmax_cuda is contained in apex. You may install it from https://github.com/NVIDIA/apex .
  2. Add --no-masked-softmax-fusion to avoid the use of fused kernel.

Thank you for your reply. Solution 2 has fixed the problem. However, after I install apex, the ModuleNotFoundError problem still occurs. The installing command is as follows: pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

I use pip list|grep apex to obtain apex with version 0.1 and find scaled_masked_softmax_cuda.cpython-310-x86_64-linux-gnu.so in the directory of torch2.0.0-cu118-cp310/lib/python3.10/site-packages. Do I fail to install scaled_softmax_cuda?

yuantailing commented 7 months ago

@liuliuliu0605 I installed apex in the same way. scaled_softmax_cuda should also be included in apex. image

liuliuliu0605 commented 7 months ago

@yuantailing Thanks for providing the details. I rember when I installed apex master branch but failed. The log is install.log. Can it be caused by incompatible cuda version ?

So I choose to install apex 22.04-dev branch, which actually does not include scaled_softmax_cuda.cu file. Therefore, the module scaled_softmax_cuda can not be found.

github-actions[bot] commented 5 months ago

Marking as stale. No activity in 60 days.