Open athitten opened 1 month ago
The same error comes from adding thunder.jit to the subsequent ResBlock here
Yikes, this is deep in the interpreter:
File "/workspace/software/NeMo/examples/multimodal/text_to_image/stable_diffusion/sd_train.py", line 80, in main
trainer.fit(model)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
...
File "/workspace/software/lightning-thunder/thunder/__init__.py", line 473, in get_computation_and_inputs
jit_results: TraceResults = interpreter(
File "/workspace/software/lightning-thunder/thunder/__init__.py", line 190, in _general_frontend
return thunder_general_jit(fn, args, kwargs, sharp_edges=sharp_edges, record_history=record_history)
File "/workspace/software/lightning-thunder/thunder/core/jit_ext.py", line 1529, in thunder_general_jit
result = jfn(*args, **kwargs)
File "/workspace/software/lightning-thunder/thunder/core/interpreter.py", line 6692, in fn_
raise InterpreterError(msg) from e
thunder.core.interpreter.InterpreterError: Encountered exception TypeError: missing a required argument: 'emb' while tracing [snip]
Since it complains about emb
, my first thought is that it's related to the embedding layers ("emb_layers
"):
... while tracing ResBlock(
(in_layers): Sequential(
(0): GroupNorm(32, 320, eps=1e-05, affine=True)
(1): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(h_upd): Identity()
(x_upd): Identity()
(emb_layers): Sequential(
(0): SiLU()
(1): Linear(in_features=1280, out_features=320, bias=True)
)
(out_layers): Sequential(
(0): GroupNorm(32, 320, eps=1e-05, affine=True)
(1): Dropout(p=0, inplace=False)
(2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(skip_connection): Identity()
):
but neither of those ops take an emb
parameter.
Anyway this is deep in code that @t-vi worked on, so I'm going to need to tag him for help. Tom, can you help us identify what went awry here? We're happy to change the input to workaround, but it's not clear what change would help thunder here.
triage review — @athitten can we provide a minimal example for this issue that @t-vi, who works at Lightning AI, can use to reproduce this failure?
🐛 Bug
Adding
thunder.jit
toResBlock
in the UNet stage of NeMo SD is raising an error. From looking at the ResBlock call in NeMo code, the class is called correctly with right arguments. In-spite of that its unsure why thunder is raising this error.Encountered exception TypeError: missing a required argument: 'emb' while tracing ResBlock
Stack trace of the error can be found here: resblock_error.log
To Reproduce
Steps to reproduce the behavior:
Pull the docker:
nvidia-internal-gitlab-host:port/athittenaman/container-images:pjnl-nemo
NeMo is installed in /opt/NeMoApply the git patch: resblock.patch
Run Stable Diffusion with the command:
cc: @tfogal
cc @tfogal