Closed wprazuch closed 1 month ago
cc`ing @tfogal @nvMelissa for vis
Relevant snippet from the full log:
[rank0]: File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 412, in interpret
[rank0]: return self._opcode_interpreter(inst, **interpreter_state)
[rank0]: File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 1227, in default_opcode_interpreter
[rank0]: return handler(inst, **interpreter_state)
[rank0]: File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 3704, in _call_function_ex_handler
[rank0]: return check_and_append(stack, _interpret_call(func, *args, **kwargs))
[rank0]: File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6357, in _interpret_call
[rank0]: rval = _call_dispatch(compilectx, runtimectx, fn, *args, **kwargs) # type: ignore
[rank0]: File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6518, in _call_dispatch
[rank0]: res = lookaside_fn(*args, **kwargs)
[rank0]: File "/opt/pytorch/lightning-thunder/thunder/core/jit_ext.py", line 640, in _general_jit_torch_autograd_function_apply_lookaside
[rank0]: _call_ctx=custom_fwd_bsyms[0]._call_ctx,
[rank0]: IndexError: list index out of range
triage: go into _general_jit_torch_autograd_function_apply_lookaside
and print out what lookaside we're processing so we can narrow this down.
on me to do that
Found after the triage meeting:
File "/usr/local/lib/python3.10/dist-packages/torch/_functorch/autograd_function.py", line 749, in __call__
return ApplyTemplate.apply(*new_fwd_args)
not sure what this means for repro, though.
With the help of @tfogal (thank you), I installed NeVa a lightning studio. Running his thunder.jit repro, I'm getting the same error in a Megatron tensor parallel reduce function
Note that I'm running this on a single GPU, so I'm suspecting that this might be related to functions actually being no-ops.
And here is a repro:
import torch, thunder
class Fn(torch.autograd.Function):
@staticmethod
def forward(self, x):
return x
@staticmethod
def backward(self, grad_x):
return grad_x
def fn(x):
return Fn.apply(x)
a = torch.randn(2)
jfn = thunder.jit(fn)
ref = fn(a)
out = jfn(a) # bug
So I guess the @tfogal assignment was for the repro, taking him off there. I understand this is nemo and neva, so I added tags, and the triage review in case we don't organically find someone to look into it (but @crcrpar, @kshitij12345, if you know anyone wanting to dive in, it is welcome, of course).
This has multiple layers:
_...ctx
assignments. (@crcrpar wdyt?)env
during the backward generation.In fact, I think we might refactor the autograd lookaside a bit.
- I think we might just get rid of all the
_...ctx
assignments. (@crcrpar wdyt?)
what's _...ctx
?
While preparing the benchmark for eager and dynamo using the code from the fork: https://github.com/tfogal/NeMo I get errors for dynamo case.
🐛 Bug
Seems like
dynamo
stopped working for NeMo NeVa model, when compiled with:it throws:
[rank0]: thunder.core.interpreter.InterpreterError: Encountered exception IndexError: list index out of range while tracing GraphModule(
To Reproduce
Steps to reproduce the behavior:
https://github.com/tfogal/NeMo
Expected behavior
The pretraining should run smoothly.
Environment
As in the container
Additional context
Attaching the full log of the error: nemo_neva_error_dynamo_23_09_24.txt
Also, with the previous thunder version on Friday, I received a different error:
Providing the full error log for that as well - maybe it will help. nemo_neva_error_dynamo_20_09_24.txt
cc @apaz-cli @tfogal