Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.09k stars 64 forks source link

TypeError: Missing a required argument with thunder.jit in NeMo SD ResBlock #548

Open athitten opened 1 month ago

athitten commented 1 month ago

🐛 Bug

Adding thunder.jit to ResBlock in the UNet stage of NeMo SD is raising an error. From looking at the ResBlock call in NeMo code, the class is called correctly with right arguments. In-spite of that its unsure why thunder is raising this error.

Encountered exception TypeError: missing a required argument: 'emb' while tracing ResBlock

Stack trace of the error can be found here: resblock_error.log

To Reproduce

Steps to reproduce the behavior:

  1. Pull the docker: nvidia-internal-gitlab-host:port/athittenaman/container-images:pjnl-nemo NeMo is installed in /opt/NeMo

  2. Apply the git patch: resblock.patch

  3. Run Stable Diffusion with the command:

    python examples/multimodal/text_to_image/stable_diffusion/sd_train.py trainer.precision=16 trainer.num_nodes=1 trainer.devices=1 ++exp_manager.max_time_per_run=00:00:03:00 trainer.max_steps=20 model.micro_batch_size=1 model.global_batch_size=1 model.data.synthetic_data=True exp_manager.exp_dir=/workspace/TestData/multimodal/stable_diffusion_train model.inductor=False model.cond_stage_config._target_=nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenCLIPEmbedder ++model.cond_stage_config.version=openai/clip-vit-large-patch14 ++model.cond_stage_config.max_length=77 ~model.cond_stage_config.restore_from_path ~model.cond_stage_config.freeze ~model.cond_stage_config.layer model.unet_config.from_pretrained=null model.first_stage_config.from_pretrained=null model.unet_config.use_flash_attention=False model.unet_config.attention_resolutions=\[1\] model.unet_config.channel_mult=\[1\]`

    cc: @tfogal

cc @tfogal

athitten commented 1 month ago

The same error comes from adding thunder.jit to the subsequent ResBlock here

tfogal commented 1 month ago

Yikes, this is deep in the interpreter:

  File "/workspace/software/NeMo/examples/multimodal/text_to_image/stable_diffusion/sd_train.py", line 80, in main
    trainer.fit(model)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
...
  File "/workspace/software/lightning-thunder/thunder/__init__.py", line 473, in get_computation_and_inputs
    jit_results: TraceResults = interpreter(
  File "/workspace/software/lightning-thunder/thunder/__init__.py", line 190, in _general_frontend
    return thunder_general_jit(fn, args, kwargs, sharp_edges=sharp_edges, record_history=record_history)
  File "/workspace/software/lightning-thunder/thunder/core/jit_ext.py", line 1529, in thunder_general_jit
    result = jfn(*args, **kwargs)
  File "/workspace/software/lightning-thunder/thunder/core/interpreter.py", line 6692, in fn_
    raise InterpreterError(msg) from e
thunder.core.interpreter.InterpreterError: Encountered exception TypeError: missing a required argument: 'emb' while tracing [snip]

Since it complains about emb, my first thought is that it's related to the embedding layers ("emb_layers"):

... while tracing ResBlock(
  (in_layers): Sequential(
    (0): GroupNorm(32, 320, eps=1e-05, affine=True)
    (1): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
  (h_upd): Identity()
  (x_upd): Identity()
  (emb_layers): Sequential(
    (0): SiLU()
    (1): Linear(in_features=1280, out_features=320, bias=True)
  )
  (out_layers): Sequential(
    (0): GroupNorm(32, 320, eps=1e-05, affine=True)
    (1): Dropout(p=0, inplace=False)
    (2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
  (skip_connection): Identity()
):

but neither of those ops take an emb parameter.

Anyway this is deep in code that @t-vi worked on, so I'm going to need to tag him for help. Tom, can you help us identify what went awry here? We're happy to change the input to workaround, but it's not clear what change would help thunder here.

mruberry commented 1 month ago

triage review — @athitten can we provide a minimal example for this issue that @t-vi, who works at Lightning AI, can use to reproduce this failure?