Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.07k stars 60 forks source link

"nvFuser" illegal memory access with falcon-7b model #659

Closed tfogal closed 2 days ago

tfogal commented 2 days ago

🐛 Bug

+ NCCL_ASYNC_ERROR_HANDLING=1
+ TORCH_NCCL_ASYNC_ERROR_HANDLING=1
+ export NCCL_ASYNC_ERROR_HANDLING=1
+ export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
An error occurred: RuntimeError \xe2\x80\x93 _result == CUDA_SUCCESS INTERNAL ASSERT FAILED at ""/opt/pytorch/nvfuser/csrc/executor_utils.cpp"":888, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. CUDA error: CUDA_ERROR_ILLEGAL_ADDRESS failed with error an illegal memory access was encountered
frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x53 (0x7ffd208d4753 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)

To Reproduce

set -e
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_TIMEOUT=600
python -m mixology_logs.execution.main \
--nsys.enable True \
--nsys.output_path /jet/assets/recipe/-falcon-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-2_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-none_-none_-s_-lit-gpt/nsys_report \
--nsys.new_kwargs '{""--nsys_enabled"": ""True"", ""--output_dir"": ""/tmp""}' \
'{""--micro_batch_size"": ""exp_range(0, 10)""}' \
""python thunder/benchmarks/benchmark_litgpt.py \
    --max_iters 20 \
    --warmup_iters 5 \
    --output_dir /jet/logs/recipe/-falcon-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-2_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-none_-none_-s_-lit-gpt \
    --model_name falcon-7b \
    --distributed_mode ddp \
    --shard_mode None \
    --compile thunder_cudnn \
    --checkpoint_activations False \
    --low_precision_mode none""

The following command gives a very similar error message, but not from nvFuser; it's just a RuntimeError exception.

set -e
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_TIMEOUT=600
python -m mixology_logs.execution.main \
--nsys.enable True \
--nsys.output_path /jet/assets/recipe/-falcon-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-1_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-fp8-delayed-te-wo-layernorm_-none_-s_-lit-gpt/nsys_report \
--nsys.new_kwargs '{""--nsys_enabled"": ""True"", ""--output_dir"": ""/tmp""}' \
'{""--micro_batch_size"": ""exp_range(0, 10)""}' \
""python thunder/benchmarks/benchmark_litgpt.py \
    --max_iters 20 \
    --warmup_iters 5 \
    --output_dir /jet/logs/recipe/-falcon-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-1_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-fp8-delayed-te-wo-layernorm_-none_-s_-lit-gpt \
    --model_name falcon-7b \
    --distributed_mode ddp \
    --shard_mode None \
    --compile thunder_cudnn \
    --checkpoint_activations False \
    --low_precision_mode fp8-delayed-te-wo_layernorm""

Additional context

This is very unlikely to actually be an nvFuser issue; probably just nvFuser happens to catch the async issue.

t-vi commented 2 days ago

Is this #583 ?

tfogal commented 2 days ago

Is this #583 ?

oops, yes, thank you! sorry about that

tfogal commented 2 days ago

Looking a bit deeper: technically this is surfacing as a different error message, but that might just be a timing issue. So there's a slim but non-zero chance we'll need to reopen this; since #583 just closed, let's see if this appears in the next round.