Open mpatel31415 opened 2 weeks ago
For falcon-7b, the same applies for following configurations:
This is quite a fascinating error and this is what I have found so far. There is a unique combination of things needed to reproduce failure -
SDPA needs to be executed by cuDNN
The executor options thunder
, eager
etc. all work just fine.
RoPE needs to be executed by nvFuser executor.
The executor thunder_inductor_cat_cudnn
does not fail. This option only differs from thunder_cudnn
in the execution of the RoPE module and some other decomposed ops.
Only specific configurations fail.
If I change the self.config.head_size=128
- thunder_cudnn
Fails with the same error. Others can run.
If I change the self.config.head_size=128
and self.config.n_embd=4096
- CUBLAS_STATUS_NOT_SUPPORTED
An error occurred: RuntimeError – CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
Again, thunder
and thunder_inductor_cat_cudnn
works but thunder_cudnn
is the only one throwing this not supported error.
If I change the self.config.head_size=128
and self.config.n_embd=4096
and self.config.block_size = 4096
- This now works for thunder_cudnn
So nvFuser executing RoPE + cuDNN executing SDPA on very specific shapes causes this error. There is also a CUBLAS_STATUS_NOT_SUPPORTED
error which does not exist with torch.compile executing RoPE. This should be changing none of the GEMM ops where CUBLAS is used but somehow it does.
The Thunder trace does not show me the strides of the tensors so I am not sure if somehow that's the difference between nvFuser execution of RoPE vs Torch.Compile execution of RoPE.
For further debugging, I have two questions from experts.
last_traces
@IvanYashchuk @t-vi - Could you help answer these questions? And suggest if there are any other ways to look into this behavior? Thanks!
fwiw this is reproduced with 20240615. Generated traces right before execution are https://gist.github.com/crcrpar/e1aef85af9d49bca120e48a17f1f801d, this time I just added a few lines for this purpose:
diff --git a/thunder/__init__.py b/thunder/__init__.py
index f5aae029..cab9617b 100644
--- a/thunder/__init__.py
+++ b/thunder/__init__.py
@@ -609,6 +609,14 @@ def jit(
backward_trc = transform(backward_trc, executors_list=cd.executors_list)
backward_traces.append(backward_trc)
+ import torch.distributed as dist
+ if (not dist.is_initialized()) or dist.get_rank() == 0:
+ with open("./fwd_trace.py", "w") as f:
+ f.write(str(computation_trc))
+ if backward_trc is not None:
+ with open("./bwd_trace.py", "w") as f:
+ f.write(str(backward_trc))
+
comp = computation_trc.python_callable()
if backward_trc is not None:
- Can we generate a trace for a failing Thunder run? If I cannot finish one iteration, I can't use
last_traces
https://github.com/Lightning-AI/lightning-thunder/blob/21a222b180009616a4cc48176958b4506894a330/thunder/__init__.py#L192 added by https://github.com/Lightning-AI/lightning-thunder/commit/21adb7d882517370021d38f94f1e1481fbd5d18d could be used to see a trace. We could edit the traces dumped into the specified path.
Also, (ab)using https://github.com/Lightning-AI/lightning-thunder/blob/21a222b180009616a4cc48176958b4506894a330/thunder/core/transforms.py#L468 or add_transforms
to just save the input traces could be a way.
@parthmannan @crcrpar @wprazuch @mpatel31415 thank you for looking into it.
- Can we generate a trace for a failing Thunder run? If I cannot finish one iteration, I can't use last_traces
While we currently don't have a thing for getting last_traces
if the error happens during construction of the trace (and transformations), we should be getting them when the computation itself fails (which I would the error in the title to be about):
def fn(a, b):
return torch.nn.functional.cross_entropy(a, b)
jfn = fn(a, b)
a = torch.randn(3, 4)
b = torch.full((3,), 5) # out of range
jfn(a, b) # gives index error
and then
thunder.last_traces(jfn)
works for me. I'd certainly be happy if we fix having the record up to the failing bits when the transform fails if the transforms fail.
- Can we directly execute a Thunder generated trace so that I can step through each op one by one and evaluate the strides of the produced tensors?
You should be able to copy-paste the trace and run it with inputs:
thunder.compile_data(jfn).get_computation_and_inputs(...)
returns you both the traces (as fields in a class) and the inputs to feed to them (it will run the prologue, but not try to execute the computation that it returns).
triage review —
Yes, this was a slip up on my end. I can actually execute last_traces
if I catch the error in a try-except
block inside the benchmark so that's possible and has helped me confirm that the failing and passing traces are equivalent apart from the unique things I listed above.
I am trying to use thunder.compile_data(jfn).get_computation_and_inputs(...)
mentioned by @t-vi above.
And I have
cache_rec, i_, _ = thunder.compile_data(self.model).get_computation_and_inputs(input_ids)
computation_trace = cache_rec.computation_traces[0]
my_exec_func = computation_trace.python_callable()
my_exec_func(*i_)
*** NotImplementedError: Attempting to execute symbol getitem outside of a tracing context, which is not supported.
UPDATE: Looks like cache_rec.computation_fn
is what I needed.
Thanks for the guidance @crcrpar @t-vi @mruberry I have narrowed down the issue and it matches the unique characteristics I pointed earlier but I am not really sure how to propose a fix. This is the execution line that causes the result to be a tensor that cannot be accessed correctly in Memory.
(t100, t101, t102, t103) = cudnn_sdpa_fwd(t96, t99, t51, None, 0.0, True, scale=0.08838834764831843)
Once this is executed, the memory gets corrupted and we start seeing illegal address error. Now, this does not happen when we use torch_compile_cat_ex executor for RoPE/Concats even though the cuDNN execution is the same. So what is the difference?
#Failing one
(Pdb) p t99.shape
torch.Size([1, 64, 2048, 128])
(Pdb) p t99.stride()
(262144, 0, 128, 1)
#Working one
(Pdb) p t99.shape
torch.Size([1, 64, 2048, 128])
(Pdb) p t99.stride()
(16777216, 262144, 128, 1)
t99
stride is the only difference. TorchCompile executor seems to have made the tensor contiguous in memory whereas nvFuser does not do that. The difference also lies in the fact that TorchCompile
executor takes in the slice operations that generate the tensors prior to the concat operation that creates t99
whereas nvFuser region does not take in the slice operations.
Where does the fix lie? Should nvFuser own the slice operation and generate a contiguous tensor? Should cuDNN be able to handle non-contiguous tensor for the input? I will post a reproducible error script for cuDNN SDPA later today. Here are the failing and passing traces
@kevinstephano @vedaanta @tfogal
Thank you so much Parth! This is excellent.
Where does the fix lie? Should nvFuser own the slice operation and generate a contiguous tensor? Should cuDNN be able to handle non-contiguous tensor for the input?
Backends are under no obligation to produce any particular striding.
On the flip side, this means that backends must accept all possible strides.
If non-contiguous is not an option for cuDNN, it can invoke contiguous()
and thereby guarantee that the tensor is contiguous (of course at a perf hit).
I think we'll need to revisit this someday (i.e. I believe we will want some way to algorithmically make decisions about how we'll stride), but that's the contract today.
As such, assigning to @vedaanta.
Thanks for the investigation folks.
A stride of zero can definitely cause issues somewhere in the cudnn stack. I will debug this today.
Should cuDNN be able to handle non-contiguous tensor for the input? it can invoke contiguous() and thereby guarantee that the tensor is contiguous (of course at a perf hit).
yes, cudnnex today does handle non-contiguous tensors.
The only requirement being enforced is that the innermost stride has to be 1. And if that is not the case, the executor calls contiguous()
on the tensor. Link to code
Okay, after a few fixes to correctly propagate stride of zero, cudnnex runs fine.
Running with compute-sanitizer no longer shows any errors.
compute-sanitizer --target-processes=all python /opt/pytorch/lightning-thunder/thunder/benchmarks/benc hmark_litgpt.py --model_name falcon-7b --compile thunder_cudnn --micro_batch_size 1
I am able to see the same convergence with cudnn as with other non-cudnn executors.
iter 44: loss 4.6875, iter time: 336.60ms, t: 2048
Model name: falcon-7b
Seq Length: 2048
Micro BS: 1
Global BS: 1
Number of Layers: 32
Number of parameters: 7.22B
Distributed Mode: none
Compiler: thunder_cudnn
Average iter time: 335.82 ms
Memory used: 58.50 GB
Tokens/s: 6094.44
Tokens/s/GPU: 6094.44
TFLOP/s: 274.87
The fixes needed are:
Okay, after a few fixes ... cudnnex runs fine.
You are a hero, @vedaanta. Thanks for jumping on this so quickly!
🐛 Bug
There is an error when training falcon-7b model with thunder_cudnn.
To Reproduce
Start a docker container:
Run inside container:
Expected behavior
We should be able to run the training.
Environment
As in the docker image, tested on H100.
Additional context
The same issue is visible for distributed training with DDP and FSDP (zero2).