databricks / megablocks

Apache License 2.0
1.17k stars 169 forks source link

Has anyone encountered this CUDA error? #62

Closed bozheng-hit closed 9 months ago

bozheng-hit commented 9 months ago

File "/home/workspace/megablocks/megatron/training.py", line 455, in train_step losses_reduced = forward_backward_func( File "/home/workspace/megablocks/megatron/core/pipeline_parallel/schedules.py", line 331, in forward_backward_no_pipelining backward_step(grad_scaler, input_tensor, output_tensor, File "/home/workspace/megablocks/megatron/core/pipeline_parallel/schedules.py", line 257, in backward_step custom_backward(output_tensor[0], output_tensor_grad[0]) File "/home/workspace/megablocks/megatron/core/pipeline_parallel/schedules.py", line 154, in custom_backward Variable._execution_engine.run_backward( File "/anaconda3/envs/moe/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, args) File "/anaconda3/envs/moe/lib/python3.10/site-packages/stk/backend/autocast.py", line 36, in decorate_bwd return bwd(args, **kwargs) File "/anaconda3/envs/moe/lib/python3.10/site-packages/megablocks/ops/padded_scatter.py", line 40, in backward dgrad = kernels.padded_gather( File "/anaconda3/envs/moe/lib/python3.10/site-packages/megablocks/backend/kernels.py", line 118, in padded_gather output_rows = padded_bins[-1].cpu().item() RuntimeError: CUDA error: an illegal memory access was encountered

I tried to run dMoE on 8x8 A100 Gpus and this error occurred frequently.

tgale96 commented 9 months ago

Hi, you're running on 64 GPUs? Can you share how you've configured the run?

bozheng-hit commented 9 months ago

Hi, you're running on 64 GPUs? Can you share how you've configured the run?

Part of the arguments are shown below, thanks for the fast reply!

DISTRIBUTED_ARGS=" --nproc_per_node 8 \ --nnodes 8 \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT "

MOE_ARGS="\ --moe-num-experts=64 \ --moe-loss-weight=0.1 \ --moe-top-k=2 \ --moe-capacity-factor=0 \ --moe-expert-model-parallelism \ --no-async-tensor-model-parallel-allreduce "

torchrun $DISTRIBUTED_ARGS ../pretrain_gpt_moe.py \ $GPT_ARGS \ $MOE_ARGS \ $DATA_ARGS \ --distributed-backend nccl

tgale96 commented 9 months ago

If the error is truly accesses the padded_bins tensor then it must have zero elements. The size of that tensor is equal to the number of experts owned by the local rank, so that could only happen if there were none on the rank.

CUDA errors are asynchronous so the real error could be somewhere else. Would you mind trying to reproduce the error with CUDA_LAUNCH_BLOCKING=1 so we can verify the source of the error? I don't currently have access to 64 GPUs and I haven't seen this on a smaller machine :/

bozheng-hit commented 9 months ago

If the error is truly accesses the padded_bins tensor then it must have zero elements. The size of that tensor is equal to the number of experts owned by the local rank, so that could only happen if there were none on the rank.

CUDA errors are asynchronous so the real error could be somewhere else. Would you mind trying to reproduce the error with CUDA_LAUNCH_BLOCKING=1 so we can verify the source of the error? I don't currently have access to 64 GPUs and I haven't seen this on a smaller machine :/

I haven't reproduced the error these days. I guess it was caused by a broken node...

tgale96 commented 9 months ago

Great, glad that is sorted out!

jramapuram commented 7 months ago

I am still seeing this error unfortunately @tgale96 @bozheng-hit . It only happens with dmoe, not moe on the latest release of megablocks and happens on different nodes:

  File "/miniconda/lib/python3.10/site-packages/megablocks/layers/moe.py", line 425, in forward
    x, tokens_per_expert = self.forward_fn(
  File "/miniconda/lib/python3.10/site-packages/megablocks/layers/dmoe.py", line 268, in forward_once
    return self.sparse_forward_once(
  File "/miniconda/lib/python3.10/site-packages/megablocks/layers/dmoe.py", line 138, in sparse_forward_once
    x = ops.padded_gather(
  File "/miniconda/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/miniconda/lib/python3.10/site-packages/stk/backend/autocast.py", line 28, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/miniconda/lib/python3.10/site-packages/megablocks/ops/padded_gather.py", line 14, in forward
    return kernels.padded_gather(
  File "/miniconda/lib/python3.10/site-packages/megablocks/backend/kernels.py", line 118, in padded_gather
    output_rows = padded_bins[-1].cpu().item()
RuntimeError: CUDA error: an illegal memory access was encountered
tgale96 commented 7 months ago

Hi! Can you run with CUDA_LAUNCH_BLOCKING=1 so we can verify the source of the error?

jramapuram commented 7 months ago

Attached -- not sure how helpful this is though.

File "/miniconda/lib/python3.10/site-packages/megablocks/layers/dmoe.py", line 327, in forward
    return self.experts(x, scores, expert_weights, top_experts)
  File "/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/lib/python3.10/site-packages/megablocks/layers/moe.py", line 425, in forward
    x, tokens_per_expert = self.forward_fn(
  File "/miniconda/lib/python3.10/site-packages/megablocks/layers/dmoe.py", line 268, in forward_once
    return self.sparse_forward_once(
  File "/miniconda/lib/python3.10/site-packages/megablocks/layers/dmoe.py", line 151, in sparse_forward_once
    x = self.mlp(x, topo)
  File "/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/lib/python3.10/site-packages/megablocks/layers/mlp.py", line 399, in forward
    return memory_optimized_mlp(
  File "/miniconda/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/miniconda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 115, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/miniconda/lib/python3.10/site-packages/megablocks/layers/mlp.py", line 207, in forward
    dsd_out = stk.ops.dsd(activation_fn_out, w2)
  File "/miniconda/lib/python3.10/site-packages/stk/ops/linear_ops.py", line 10, in dsd
    return sputnik.dsd(
  File "/miniconda/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/miniconda/lib/python3.10/site-packages/stk/backend/autocast.py", line 28, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/miniconda/lib/python3.10/site-packages/stk/backend/sputnik.py", line 116, in forward
    backend.dsd(shape,
  File "/miniconda/lib/python3.10/site-packages/stk/backend/triton_kernels.py", line 235, in dsd
    _dsd_kernel[grid](
  File "/miniconda/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 156, in run
    ret = self.fn.run(
  File "/miniconda/lib/python3.10/site-packages/triton/runtime/jit.py", line 550, in run
    bin.c_wrapper(
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /tmp/tmp.uxd39ue5d5/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f7f1333cc9c in /miniconda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f7f132e6a5c in /miniconda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3cc (0x7f7f133f4c8c in /miniconda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xff2324 (0x7f7f14418324 in /miniconda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x5808c4 (0x7f7f3f2708c4 in /miniconda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x63151 (0x7f7f13320151 in /miniconda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x223 (0x7f7f13318593 in /miniconda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0xd (0x7f7f1331872d in /miniconda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x85b448 (0x7f7f3f54b448 in /miniconda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x316 (0x7f7f3f54b7e6 in /miniconda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1243f3 (0x556a4945d3f3 in /miniconda/bin/python)
frame #11: <unknown function> + 0x13d5a7 (0x556a494765a7 in /miniconda/bin/python)
frame #12: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #13: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #14: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #15: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #16: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #17: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #18: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #19: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #20: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #21: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #22: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #23: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #24: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #25: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #26: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #27: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #28: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #29: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #30: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #31: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #32: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #33: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #34: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #35: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #36: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #37: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #38: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #39: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #40: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #41: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #42: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #43: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #44: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #45: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #46: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #47: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #48: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #49: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #50: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #51: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #52: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #53: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #54: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #55: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #56: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #57: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #58: <unknown function> + 0x14db76 (0x556a49486b76 in /miniconda/bin/python)
frame #59: _PyTrash_thread_destroy_chain + 0x29 (0x556a49552979 in /miniconda/bin/python)
frame #60: <unknown function> + 0x125a6f (0x556a4945ea6f in /miniconda/bin/python)
frame #61: PyDict_SetItemString + 0x51 (0x556a49462261 in /miniconda/bin/python)
frame #62: <unknown function> + 0x208316 (0x556a49541316 in /miniconda/bin/python)
frame #63: Py_FinalizeEx + 0x146 (0x556a49540526 in /miniconda/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
tgale96 commented 7 months ago

Its very helpful! The error is in one of the kernels rather than in the function that errored originally.

Could you share a script to reproduce the error?

jramapuram commented 7 months ago

I can try to repro a minimal script, but it might take a bit. Appreciate you taking a look here 🙏

tgale96 commented 7 months ago

Happy to help!

whksmo commented 6 months ago

Same issue here. I notice that after the very first mega_op (e.g. sort), the output tensor got really weired, producing big values at ~1e8, which then causes the illegal memory access error. The exception will raise during model inference unless the model is deployed on cuda:0. Seems like the cuda stream used by mega_op differ with the device holding the model.

tgale96 commented 6 months ago

Interesting. Are you using the sparse or grouped code paths?

jramapuram commented 6 months ago

Sparse here

tgale96 commented 6 months ago

Do you run into the same error with the grouped MLP?