Open wprazuch opened 2 weeks ago
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 599, in wrapper
[rank0]: outputs = fn(ctx, *args)
[rank0]: File "/opt/pytorch/lightning-thunder/thunder/executors/torch_autograd.py", line 96, in backward
[rank0]: grads = ctx.compiled_backward([saved_tensors_list, ctx.saved_other], args)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
Looks like we're eventually asking dynamo to do something that it cannot due to our autograd.
triage: is there something we can do to not tickle dynamo or do we need to just report this upstream?
Asking dynamo to do something that it cannot due to our generated backward trace:
File "thunder.backward_fn_333", line 462, in backward_fn
and use of fullgraph=True
(added in https://github.com/Lightning-AI/lightning-thunder/commit/e0ab64867a5be914d0548c195a3f850a76c8c397) https://github.com/Lightning-AI/lightning-thunder/blob/72e033a0e0dfe44d4770dec2399a9058971003ec/thunder/executors/torch_compile.py#L86
setting fullgraph=False might fix this problem.
@wprazuch can I ask you to do a one-off that tests this with fullgraph=False
, as Ivan points out above?
(I don't know that this the long-term solution but it will allow us to have a more reasoned discussion on the long-term solution.)
We can confirm that after the modification in torch_compile.py: compiled_func = torch.compile(trace_callable, fullgraph=False)
there is no error :)
Thanks Martyna, Wojciech!
triage review:
🐛 Bug
There is unsupported error when running models:
for Thunder inductor for fsdp zero2/zero3:
To Reproduce
Steps to reproduce the behavior:
Run in the container:
Expected behavior
The model should run or we should get OOM error.
Environment
As in the Docker image
Additional context
We reproduced for fsdp (1/2 nodes, 8 gpus), zero2/zero3. The traceback is below: