CUDA error: CUDA_ERROR_ILLEGAL_ADDRESS failed when training falcon-7b

Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.

Apache License 2.0

1.06k stars 59 forks source link

CUDA error: CUDA_ERROR_ILLEGAL_ADDRESS failed when training falcon-7b #583

Open mpatel31415 opened 2 weeks ago

mpatel31415 commented 2 weeks ago

🐛 Bug

There is an error when training falcon-7b model with thunder_cudnn.

To Reproduce

Start a docker container:

mkdir -p output
docker run --pull=always --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864  -v $PWD/output:/output -it INTERNAL_IMAGE:pjnl-20240607

Run inside container:

python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name falcon-7b --compile thunder_cudnn --micro_batch_size 1

Expected behavior

We should be able to run the training.

Environment

As in the docker image, tested on H100.

Additional context

The same issue is visible for distributed training with DDP and FSDP (zero2).

wprazuch commented 1 week ago

For falcon-7b, the same applies for following configurations:

1 node, 8 gpus, both ddp and fsdp
1 node, 1 gpu

parthmannan commented 1 week ago

This is quite a fascinating error and this is what I have found so far. There is a unique combination of things needed to reproduce failure -

SDPA needs to be executed by cuDNN The executor options thunder, eager etc. all work just fine.
RoPE needs to be executed by nvFuser executor. The executor thunder_inductor_cat_cudnn does not fail. This option only differs from thunder_cudnn in the execution of the RoPE module and some other decomposed ops.
Only specific configurations fail. If I change the self.config.head_size=128 - thunder_cudnn Fails with the same error. Others can run. If I change the self.config.head_size=128 and self.config.n_embd=4096 - CUBLAS_STATUS_NOT_SUPPORTED
```
An error occurred: RuntimeError – CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
```
Again, thunder and thunder_inductor_cat_cudnn works but thunder_cudnn is the only one throwing this not supported error. If I change the self.config.head_size=128 and self.config.n_embd=4096 and self.config.block_size = 4096 - This now works for thunder_cudnn

So nvFuser executing RoPE + cuDNN executing SDPA on very specific shapes causes this error. There is also a CUBLAS_STATUS_NOT_SUPPORTED error which does not exist with torch.compile executing RoPE. This should be changing none of the GEMM ops where CUBLAS is used but somehow it does. The Thunder trace does not show me the strides of the tensors so I am not sure if somehow that's the difference between nvFuser execution of RoPE vs Torch.Compile execution of RoPE.

For further debugging, I have two questions from experts.

Can we generate a trace for a failing Thunder run? If I cannot finish one iteration, I can't use last_traces
Can we directly execute a Thunder generated trace so that I can step through each op one by one and evaluate the strides of the produced tensors?

@IvanYashchuk @t-vi - Could you help answer these questions? And suggest if there are any other ways to look into this behavior? Thanks!

crcrpar commented 1 week ago

fwiw this is reproduced with 20240615. Generated traces right before execution are https://gist.github.com/crcrpar/e1aef85af9d49bca120e48a17f1f801d, this time I just added a few lines for this purpose:

diff --git a/thunder/__init__.py b/thunder/__init__.py
index f5aae029..cab9617b 100644
--- a/thunder/__init__.py
+++ b/thunder/__init__.py
@@ -609,6 +609,14 @@ def jit(
                     backward_trc = transform(backward_trc, executors_list=cd.executors_list)
                     backward_traces.append(backward_trc)

+            import torch.distributed as dist
+            if (not dist.is_initialized()) or dist.get_rank() == 0:
+                with open("./fwd_trace.py", "w") as f:
+                    f.write(str(computation_trc))
+                if backward_trc is not None:
+                    with open("./bwd_trace.py", "w") as f:
+                        f.write(str(backward_trc))
+
             comp = computation_trc.python_callable()

             if backward_trc is not None:

Can we generate a trace for a failing Thunder run? If I cannot finish one iteration, I can't use last_traces

https://github.com/Lightning-AI/lightning-thunder/blob/21a222b180009616a4cc48176958b4506894a330/thunder/__init__.py#L192 added by https://github.com/Lightning-AI/lightning-thunder/commit/21adb7d882517370021d38f94f1e1481fbd5d18d could be used to see a trace. We could edit the traces dumped into the specified path. Also, (ab)using https://github.com/Lightning-AI/lightning-thunder/blob/21a222b180009616a4cc48176958b4506894a330/thunder/core/transforms.py#L468 or add_transforms to just save the input traces could be a way.

t-vi commented 1 week ago

@parthmannan @crcrpar @wprazuch @mpatel31415 thank you for looking into it.

Can we generate a trace for a failing Thunder run? If I cannot finish one iteration, I can't use last_traces

While we currently don't have a thing for getting last_traces if the error happens during construction of the trace (and transformations), we should be getting them when the computation itself fails (which I would the error in the title to be about):

def fn(a, b):
     return torch.nn.functional.cross_entropy(a, b)
jfn = fn(a, b)
a = torch.randn(3, 4)
b = torch.full((3,), 5)  # out of range
jfn(a, b)  # gives index error

and then

thunder.last_traces(jfn)

works for me. I'd certainly be happy if we fix having the record up to the failing bits when the transform fails if the transforms fail.

Can we directly execute a Thunder generated trace so that I can step through each op one by one and evaluate the strides of the produced tensors?

You should be able to copy-paste the trace and run it with inputs:

thunder.compile_data(jfn).get_computation_and_inputs(...) returns you both the traces (as fields in a class) and the inputs to feed to them (it will run the prologue, but not try to execute the computation that it returns).

mruberry commented 1 week ago

triage review —

it seems like last traces should be accessible even if there is a failure (thunder needed to construct a program to begin execution)
it would be interesting to consider more tools for debugging
we're curious what @parthmannan has to say

parthmannan commented 1 week ago

Yes, this was a slip up on my end. I can actually execute last_traces if I catch the error in a try-except block inside the benchmark so that's possible and has helped me confirm that the failing and passing traces are equivalent apart from the unique things I listed above.

I am trying to use thunder.compile_data(jfn).get_computation_and_inputs(...) mentioned by @t-vi above.

And I have

cache_rec, i_, _ = thunder.compile_data(self.model).get_computation_and_inputs(input_ids)
computation_trace = cache_rec.computation_traces[0]
my_exec_func = computation_trace.python_callable()
my_exec_func(*i_)

*** NotImplementedError: Attempting to execute symbol getitem outside of a tracing context, which is not supported.

UPDATE: Looks like cache_rec.computation_fn is what I needed.

parthmannan commented 1 week ago

Thanks for the guidance @crcrpar @t-vi @mruberry I have narrowed down the issue and it matches the unique characteristics I pointed earlier but I am not really sure how to propose a fix. This is the execution line that causes the result to be a tensor that cannot be accessed correctly in Memory.

(t100, t101, t102, t103) = cudnn_sdpa_fwd(t96, t99, t51, None, 0.0, True, scale=0.08838834764831843)

Once this is executed, the memory gets corrupted and we start seeing illegal address error. Now, this does not happen when we use torch_compile_cat_ex executor for RoPE/Concats even though the cuDNN execution is the same. So what is the difference?

#Failing one
(Pdb) p t99.shape
torch.Size([1, 64, 2048, 128])
(Pdb) p t99.stride()
(262144, 0, 128, 1)

#Working one
(Pdb) p t99.shape
torch.Size([1, 64, 2048, 128])
(Pdb) p t99.stride()
(16777216, 262144, 128, 1)

t99 stride is the only difference. TorchCompile executor seems to have made the tensor contiguous in memory whereas nvFuser does not do that. The difference also lies in the fact that TorchCompile executor takes in the slice operations that generate the tensors prior to the concat operation that creates t99 whereas nvFuser region does not take in the slice operations.

Where does the fix lie? Should nvFuser own the slice operation and generate a contiguous tensor? Should cuDNN be able to handle non-contiguous tensor for the input? I will post a reproducible error script for cuDNN SDPA later today. Here are the failing and passing traces

@kevinstephano @vedaanta @tfogal

tfogal commented 1 week ago

Thank you so much Parth! This is excellent.

Where does the fix lie? Should nvFuser own the slice operation and generate a contiguous tensor? Should cuDNN be able to handle non-contiguous tensor for the input?

Backends are under no obligation to produce any particular striding. On the flip side, this means that backends must accept all possible strides. If non-contiguous is not an option for cuDNN, it can invoke contiguous() and thereby guarantee that the tensor is contiguous (of course at a perf hit).

I think we'll need to revisit this someday (i.e. I believe we will want some way to algorithmically make decisions about how we'll stride), but that's the contract today.

As such, assigning to @vedaanta.

vedaanta commented 1 week ago

Thanks for the investigation folks.

A stride of zero can definitely cause issues somewhere in the cudnn stack. I will debug this today.

Should cuDNN be able to handle non-contiguous tensor for the input? it can invoke contiguous() and thereby guarantee that the tensor is contiguous (of course at a perf hit).

yes, cudnnex today does handle non-contiguous tensors. The only requirement being enforced is that the innermost stride has to be 1. And if that is not the case, the executor calls contiguous() on the tensor. Link to code

vedaanta commented 1 week ago

Okay, after a few fixes to correctly propagate stride of zero, cudnnex runs fine.

Running with compute-sanitizer no longer shows any errors. compute-sanitizer --target-processes=all python /opt/pytorch/lightning-thunder/thunder/benchmarks/benc hmark_litgpt.py --model_name falcon-7b --compile thunder_cudnn --micro_batch_size 1

I am able to see the same convergence with cudnn as with other non-cudnn executors.

iter 44: loss 4.6875, iter time: 336.60ms, t: 2048
Model name: falcon-7b
Seq Length: 2048
Micro BS: 1
Global BS: 1
Number of Layers: 32
Number of parameters: 7.22B
Distributed Mode: none
Compiler: thunder_cudnn
Average iter time: 335.82 ms
Memory used: 58.50 GB
Tokens/s: 6094.44
Tokens/s/GPU: 6094.44
TFLOP/s: 274.87

The fixes needed are:

In thunder's cudnn, there is a bug where NHWC input strides are always passed down to cudnn. This disregards non-contiguous tensors that users might provide. (Will fix it in thunder soon)
In cudnn-frontend, there is a check, deep in the code, which blocks stride of 0. (Can be fixed in the next frontend release. By ~6/28)
In cudnn-backend, due to TMA requirements, stride 0 for embedding dim needs to be blocked. (Can be blocked in the next frontend release, so that thunder doesn't need custom code here.)

tfogal commented 1 week ago

Okay, after a few fixes ... cudnnex runs fine.

You are a hero, @vedaanta. Thanks for jumping on this so quickly!