Closed bozheng-hit closed 9 months ago
Hi, you're running on 64 GPUs? Can you share how you've configured the run?
Hi, you're running on 64 GPUs? Can you share how you've configured the run?
Part of the arguments are shown below, thanks for the fast reply!
DISTRIBUTED_ARGS=" --nproc_per_node 8 \ --nnodes 8 \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT "
MOE_ARGS="\ --moe-num-experts=64 \ --moe-loss-weight=0.1 \ --moe-top-k=2 \ --moe-capacity-factor=0 \ --moe-expert-model-parallelism \ --no-async-tensor-model-parallel-allreduce "
torchrun $DISTRIBUTED_ARGS ../pretrain_gpt_moe.py \ $GPT_ARGS \ $MOE_ARGS \ $DATA_ARGS \ --distributed-backend nccl
If the error is truly accesses the padded_bins
tensor then it must have zero elements. The size of that tensor is equal to the number of experts owned by the local rank, so that could only happen if there were none on the rank.
CUDA errors are asynchronous so the real error could be somewhere else. Would you mind trying to reproduce the error with CUDA_LAUNCH_BLOCKING=1 so we can verify the source of the error? I don't currently have access to 64 GPUs and I haven't seen this on a smaller machine :/
If the error is truly accesses the
padded_bins
tensor then it must have zero elements. The size of that tensor is equal to the number of experts owned by the local rank, so that could only happen if there were none on the rank.CUDA errors are asynchronous so the real error could be somewhere else. Would you mind trying to reproduce the error with CUDA_LAUNCH_BLOCKING=1 so we can verify the source of the error? I don't currently have access to 64 GPUs and I haven't seen this on a smaller machine :/
I haven't reproduced the error these days. I guess it was caused by a broken node...
Great, glad that is sorted out!
I am still seeing this error unfortunately @tgale96 @bozheng-hit . It only happens with dmoe
, not moe
on the latest release of megablocks and happens on different nodes:
File "/miniconda/lib/python3.10/site-packages/megablocks/layers/moe.py", line 425, in forward
x, tokens_per_expert = self.forward_fn(
File "/miniconda/lib/python3.10/site-packages/megablocks/layers/dmoe.py", line 268, in forward_once
return self.sparse_forward_once(
File "/miniconda/lib/python3.10/site-packages/megablocks/layers/dmoe.py", line 138, in sparse_forward_once
x = ops.padded_gather(
File "/miniconda/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/miniconda/lib/python3.10/site-packages/stk/backend/autocast.py", line 28, in decorate_fwd
return fwd(*args, **kwargs)
File "/miniconda/lib/python3.10/site-packages/megablocks/ops/padded_gather.py", line 14, in forward
return kernels.padded_gather(
File "/miniconda/lib/python3.10/site-packages/megablocks/backend/kernels.py", line 118, in padded_gather
output_rows = padded_bins[-1].cpu().item()
RuntimeError: CUDA error: an illegal memory access was encountered
Hi! Can you run with CUDA_LAUNCH_BLOCKING=1
so we can verify the source of the error?
Attached -- not sure how helpful this is though.
File "/miniconda/lib/python3.10/site-packages/megablocks/layers/dmoe.py", line 327, in forward
return self.experts(x, scores, expert_weights, top_experts)
File "/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda/lib/python3.10/site-packages/megablocks/layers/moe.py", line 425, in forward
x, tokens_per_expert = self.forward_fn(
File "/miniconda/lib/python3.10/site-packages/megablocks/layers/dmoe.py", line 268, in forward_once
return self.sparse_forward_once(
File "/miniconda/lib/python3.10/site-packages/megablocks/layers/dmoe.py", line 151, in sparse_forward_once
x = self.mlp(x, topo)
File "/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda/lib/python3.10/site-packages/megablocks/layers/mlp.py", line 399, in forward
return memory_optimized_mlp(
File "/miniconda/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/miniconda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 115, in decorate_fwd
return fwd(*args, **kwargs)
File "/miniconda/lib/python3.10/site-packages/megablocks/layers/mlp.py", line 207, in forward
dsd_out = stk.ops.dsd(activation_fn_out, w2)
File "/miniconda/lib/python3.10/site-packages/stk/ops/linear_ops.py", line 10, in dsd
return sputnik.dsd(
File "/miniconda/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/miniconda/lib/python3.10/site-packages/stk/backend/autocast.py", line 28, in decorate_fwd
return fwd(*args, **kwargs)
File "/miniconda/lib/python3.10/site-packages/stk/backend/sputnik.py", line 116, in forward
backend.dsd(shape,
File "/miniconda/lib/python3.10/site-packages/stk/backend/triton_kernels.py", line 235, in dsd
_dsd_kernel[grid](
File "/miniconda/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 156, in run
ret = self.fn.run(
File "/miniconda/lib/python3.10/site-packages/triton/runtime/jit.py", line 550, in run
bin.c_wrapper(
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /tmp/tmp.uxd39ue5d5/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f7f1333cc9c in /miniconda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f7f132e6a5c in /miniconda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3cc (0x7f7f133f4c8c in /miniconda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xff2324 (0x7f7f14418324 in /miniconda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x5808c4 (0x7f7f3f2708c4 in /miniconda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x63151 (0x7f7f13320151 in /miniconda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x223 (0x7f7f13318593 in /miniconda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0xd (0x7f7f1331872d in /miniconda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x85b448 (0x7f7f3f54b448 in /miniconda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x316 (0x7f7f3f54b7e6 in /miniconda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1243f3 (0x556a4945d3f3 in /miniconda/bin/python)
frame #11: <unknown function> + 0x13d5a7 (0x556a494765a7 in /miniconda/bin/python)
frame #12: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #13: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #14: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #15: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #16: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #17: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #18: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #19: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #20: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #21: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #22: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #23: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #24: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #25: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #26: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #27: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #28: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #29: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #30: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #31: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #32: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #33: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #34: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #35: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #36: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #37: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #38: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #39: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #40: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #41: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #42: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #43: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #44: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #45: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #46: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #47: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #48: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #49: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #50: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #51: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #52: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #53: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #54: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #55: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #56: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #57: <unknown function> + 0x13d68b (0x556a4947668b in /miniconda/bin/python)
frame #58: <unknown function> + 0x14db76 (0x556a49486b76 in /miniconda/bin/python)
frame #59: _PyTrash_thread_destroy_chain + 0x29 (0x556a49552979 in /miniconda/bin/python)
frame #60: <unknown function> + 0x125a6f (0x556a4945ea6f in /miniconda/bin/python)
frame #61: PyDict_SetItemString + 0x51 (0x556a49462261 in /miniconda/bin/python)
frame #62: <unknown function> + 0x208316 (0x556a49541316 in /miniconda/bin/python)
frame #63: Py_FinalizeEx + 0x146 (0x556a49540526 in /miniconda/bin/python)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Its very helpful! The error is in one of the kernels rather than in the function that errored originally.
Could you share a script to reproduce the error?
I can try to repro a minimal script, but it might take a bit. Appreciate you taking a look here 🙏
Happy to help!
Same issue here. I notice that after the very first mega_op (e.g. sort), the output tensor got really weired, producing big values at ~1e8, which then causes the illegal memory access error. The exception will raise during model inference unless the model is deployed on cuda:0. Seems like the cuda stream used by mega_op differ with the device holding the model.
Interesting. Are you using the sparse or grouped code paths?
Sparse here
Do you run into the same error with the grouped MLP?
File "/home/workspace/megablocks/megatron/training.py", line 455, in train_step losses_reduced = forward_backward_func( File "/home/workspace/megablocks/megatron/core/pipeline_parallel/schedules.py", line 331, in forward_backward_no_pipelining backward_step(grad_scaler, input_tensor, output_tensor, File "/home/workspace/megablocks/megatron/core/pipeline_parallel/schedules.py", line 257, in backward_step custom_backward(output_tensor[0], output_tensor_grad[0]) File "/home/workspace/megablocks/megatron/core/pipeline_parallel/schedules.py", line 154, in custom_backward Variable._execution_engine.run_backward( File "/anaconda3/envs/moe/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, args) File "/anaconda3/envs/moe/lib/python3.10/site-packages/stk/backend/autocast.py", line 36, in decorate_bwd return bwd(args, **kwargs) File "/anaconda3/envs/moe/lib/python3.10/site-packages/megablocks/ops/padded_scatter.py", line 40, in backward dgrad = kernels.padded_gather( File "/anaconda3/envs/moe/lib/python3.10/site-packages/megablocks/backend/kernels.py", line 118, in padded_gather output_rows = padded_bins[-1].cpu().item() RuntimeError: CUDA error: an illegal memory access was encountered
I tried to run dMoE on 8x8 A100 Gpus and this error occurred frequently.