[NeVa] thunder.core.interpreter.InterpreterError: Encountered exception IndexError: list index out of range while tracing GraphModule

wprazuch commented 1 month ago

While preparing the benchmark for eager and dynamo using the code from the fork: https://github.com/tfogal/NeMo I get errors for dynamo case.

🐛 Bug

Seems like dynamo stopped working for NeMo NeVa model, when compiled with:

model.model = torch.compile(backend=thunder_backend, dynamic=False)(model.model)

it throws: [rank0]: thunder.core.interpreter.InterpreterError: Encountered exception IndexError: list index out of range while tracing GraphModule(

To Reproduce

Steps to reproduce the behavior:

Clone: https://github.com/tfogal/NeMo
Use latest lightning-thunder version container

Install additionally:

python3 -m pip install --no-deps huggingface-hub==0.23.2
python3 -m pip install --no-deps transformers==4.40.2
python3 -m pip install -e .
python3 -m pip install git+https://github.com/NVIDIA/Megatron-LM.git@6dd3a1afa4e26d4d27e58d1e83aaa6ee6e36b477

Execute:

rm -f /tmp/graph*.log.txt
export  HYDRA_FULL_ERROR=1
export  THUNDER_ANNOTATE_TRACES=1
export  NEMO_THUNDER_NEVA=dynamo
python3 \
./examples/multimodal/multimodal_llm/neva/neva_pretrain.py \
  trainer.precision=bf16-mixed \
  model.megatron_amp_O2=True \
  model.mcore_gpt=False \
  trainer.num_nodes=1 \
  trainer.devices=1 \
  trainer.val_check_interval=10 \
  trainer.limit_val_batches=5 \
  trainer.log_every_n_steps=1 \
  ++exp_manager.max_time_per_run=00:00:03:00 \
  trainer.max_steps=20 \
  model.micro_batch_size=2 \
  model.global_batch_size=4 \
  model.tensor_model_parallel_size=1 \
  model.pipeline_model_parallel_size=1 \
  exp_manager.create_checkpoint_callback=False \
  model.data.data_path=./data/multimodal/tiny-neva/dummy.json \
  model.data.image_folder=./data/multimodal/tiny-neva/images \
  model.tokenizer.library=sentencepiece \
  model.tokenizer.model=./data/multimodal/tiny-neva/tokenizer_add_special.model \
  model.num_layers=2 \
  model.hidden_size=5120 \
  model.ffn_hidden_size=13824 \
  model.num_attention_heads=40 \
  model.normalization=rmsnorm \
  model.data.num_workers=0 \
  model.data.conv_template=llama_2 \
  model.mm_cfg.vision_encoder.from_pretrained=openai/clip-vit-large-patch14 \
  model.mm_cfg.llm.from_pretrained=null \
  model.use_flash_attention=false \
  exp_manager.exp_dir=./nemo_neva

Expected behavior

The pretraining should run smoothly.

Environment

As in the container

Additional context

Attaching the full log of the error: nemo_neva_error_dynamo_23_09_24.txt

Also, with the previous thunder version on Friday, I received a different error:

[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 5920, in impl
[rank0]:     return tos1.__setitem__(tos, tos2)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6407, in _impl
[rank0]:     return fn.__func__(fn.__self__, *args, **kwargs)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 1277, in wrapping_wrapper
[rank0]:     res = ufn(*uargs, **ukwargs)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/jit_ext.py", line 387, in wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/proxies.py", line 1543, in __getattr__
[rank0]:     baseutils.check(method_or_value is not None, lambda: f"Unknown attribute {attr}", exception_type=AttributeError)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/baseutils.py", line 103, in check
[rank0]:     raise exception_type(s())
[rank0]: AttributeError: Unknown attribute __setitem__. Did you mean: '_return_value'?
Epoch 0: :   0%|          | 0/50 [00:45<?]

Providing the full error log for that as well - maybe it will help. nemo_neva_error_dynamo_20_09_24.txt

cc @apaz-cli @tfogal

wprazuch commented 1 month ago

cc`ing @tfogal @nvMelissa for vis

tfogal commented 1 month ago

Relevant snippet from the full log:

[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 412, in interpret
[rank0]:     return self._opcode_interpreter(inst, **interpreter_state)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 1227, in default_opcode_interpreter
[rank0]:     return handler(inst, **interpreter_state)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 3704, in _call_function_ex_handler
[rank0]:     return check_and_append(stack, _interpret_call(func, *args, **kwargs))
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6357, in _interpret_call
[rank0]:     rval = _call_dispatch(compilectx, runtimectx, fn, *args, **kwargs)  # type: ignore
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6518, in _call_dispatch
[rank0]:     res = lookaside_fn(*args, **kwargs)
[rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/jit_ext.py", line 640, in _general_jit_torch_autograd_function_apply_lookaside
[rank0]:     _call_ctx=custom_fwd_bsyms[0]._call_ctx,
[rank0]: IndexError: list index out of range

tfogal commented 1 month ago

triage: go into _general_jit_torch_autograd_function_apply_lookaside and print out what lookaside we're processing so we can narrow this down. on me to do that