Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.2k stars 80 forks source link

[NeVa] thunder.core.interpreter.InterpreterError: Encountered exception IndexError: list index out of range while tracing GraphModule #1187

Closed wprazuch closed 1 month ago

wprazuch commented 1 month ago

While preparing the benchmark for eager and dynamo using the code from the fork: https://github.com/tfogal/NeMo I get errors for dynamo case.

🐛 Bug

Seems like dynamo stopped working for NeMo NeVa model, when compiled with:

model.model = torch.compile(backend=thunder_backend, dynamic=False)(model.model)

it throws: [rank0]: thunder.core.interpreter.InterpreterError: Encountered exception IndexError: list index out of range while tracing GraphModule(

To Reproduce

Steps to reproduce the behavior:

  1. Clone: https://github.com/tfogal/NeMo
  2. Use latest lightning-thunder version container
  3. Install additionally:
    python3 -m pip install --no-deps huggingface-hub==0.23.2
    python3 -m pip install --no-deps transformers==4.40.2
    python3 -m pip install -e .
    python3 -m pip install git+https://github.com/NVIDIA/Megatron-LM.git@6dd3a1afa4e26d4d27e58d1e83aaa6ee6e36b477
  4. Execute:
    rm -f /tmp/graph*.log.txt
    export  HYDRA_FULL_ERROR=1
    export  THUNDER_ANNOTATE_TRACES=1
    export  NEMO_THUNDER_NEVA=dynamo
    python3 \
    ./examples/multimodal/multimodal_llm/neva/neva_pretrain.py \
      trainer.precision=bf16-mixed \
      model.megatron_amp_O2=True \
      model.mcore_gpt=False \
      trainer.num_nodes=1 \
      trainer.devices=1 \
      trainer.val_check_interval=10 \
      trainer.limit_val_batches=5 \
      trainer.log_every_n_steps=1 \
      ++exp_manager.max_time_per_run=00:00:03:00 \
      trainer.max_steps=20 \
      model.micro_batch_size=2 \
      model.global_batch_size=4 \
      model.tensor_model_parallel_size=1 \
      model.pipeline_model_parallel_size=1 \
      exp_manager.create_checkpoint_callback=False \
      model.data.data_path=./data/multimodal/tiny-neva/dummy.json \
      model.data.image_folder=./data/multimodal/tiny-neva/images \
      model.tokenizer.library=sentencepiece \
      model.tokenizer.model=./data/multimodal/tiny-neva/tokenizer_add_special.model \
      model.num_layers=2 \
      model.hidden_size=5120 \
      model.ffn_hidden_size=13824 \
      model.num_attention_heads=40 \
      model.normalization=rmsnorm \
      model.data.num_workers=0 \
      model.data.conv_template=llama_2 \
      model.mm_cfg.vision_encoder.from_pretrained=openai/clip-vit-large-patch14 \
      model.mm_cfg.llm.from_pretrained=null \
      model.use_flash_attention=false \
      exp_manager.exp_dir=./nemo_neva

Expected behavior

The pretraining should run smoothly.

Environment

As in the container

Additional context

Attaching the full log of the error: nemo_neva_error_dynamo_23_09_24.txt

Also, with the previous thunder version on Friday, I received a different error:

[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 5920, in impl
[rank0]:     return tos1.__setitem__(tos, tos2)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6407, in _impl
[rank0]:     return fn.__func__(fn.__self__, *args, **kwargs)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 1277, in wrapping_wrapper
[rank0]:     res = ufn(*uargs, **ukwargs)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/jit_ext.py", line 387, in wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/proxies.py", line 1543, in __getattr__
[rank0]:     baseutils.check(method_or_value is not None, lambda: f"Unknown attribute {attr}", exception_type=AttributeError)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/baseutils.py", line 103, in check
[rank0]:     raise exception_type(s())
[rank0]: AttributeError: Unknown attribute __setitem__. Did you mean: '_return_value'?
Epoch 0: :   0%|          | 0/50 [00:45<?]  

Providing the full error log for that as well - maybe it will help. nemo_neva_error_dynamo_20_09_24.txt

cc @apaz-cli @tfogal

wprazuch commented 1 month ago

cc`ing @tfogal @nvMelissa for vis

tfogal commented 1 month ago

Relevant snippet from the full log:

[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 412, in interpret
[rank0]:     return self._opcode_interpreter(inst, **interpreter_state)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 1227, in default_opcode_interpreter
[rank0]:     return handler(inst, **interpreter_state)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 3704, in _call_function_ex_handler
[rank0]:     return check_and_append(stack, _interpret_call(func, *args, **kwargs))
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6357, in _interpret_call
[rank0]:     rval = _call_dispatch(compilectx, runtimectx, fn, *args, **kwargs)  # type: ignore
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6518, in _call_dispatch
[rank0]:     res = lookaside_fn(*args, **kwargs)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/jit_ext.py", line 640, in _general_jit_torch_autograd_function_apply_lookaside
[rank0]:     _call_ctx=custom_fwd_bsyms[0]._call_ctx,
[rank0]: IndexError: list index out of range
tfogal commented 1 month ago

triage: go into _general_jit_torch_autograd_function_apply_lookaside and print out what lookaside we're processing so we can narrow this down. on me to do that

t-vi commented 1 month ago

Found after the triage meeting:

  File "/usr/local/lib/python3.10/dist-packages/torch/_functorch/autograd_function.py", line 749, in __call__
     return ApplyTemplate.apply(*new_fwd_args)

not sure what this means for repro, though.

t-vi commented 1 month ago

With the help of @tfogal (thank you), I installed NeVa a lightning studio. Running his thunder.jit repro, I'm getting the same error in a Megatron tensor parallel reduce function

https://github.com/NVIDIA/Megatron-LM/blob/45bf4c1821c2a87ca02e8a0377d61097bac92d07/megatron/core/tensor_parallel/mappings.py#L257-L273

Note that I'm running this on a single GPU, so I'm suspecting that this might be related to functions actually being no-ops.

t-vi commented 1 month ago

And here is a repro:

import torch, thunder
class Fn(torch.autograd.Function):
    @staticmethod
    def forward(self, x):
        return x
    @staticmethod
    def backward(self, grad_x):
       return grad_x

def fn(x):
    return Fn.apply(x)

a = torch.randn(2)
jfn = thunder.jit(fn)

ref = fn(a)
out = jfn(a)  # bug
t-vi commented 1 month ago

So I guess the @tfogal assignment was for the repro, taking him off there. I understand this is nemo and neva, so I added tags, and the triage review in case we don't organically find someone to look into it (but @crcrpar, @kshitij12345, if you know anyone wanting to dive in, it is welcome, of course).

t-vi commented 1 month ago

This has multiple layers:

t-vi commented 1 month ago

In fact, I think we might refactor the autograd lookaside a bit.

crcrpar commented 1 month ago
  • I think we might just get rid of all the _...ctx assignments. (@crcrpar wdyt?)

what's _...ctx?