Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.15k stars 3.37k forks source link

Support `ThunderModule` models #19746

Open tfogal opened 6 months ago

tfogal commented 6 months ago

Description & Motivation

I'm trying to get the NeMo multimodel imagen example to use thunder, but Lightning itself does not support Thunder:

Error executing job with overrides: ['trainer.precision=16', 'trainer.num_nodes=1', 'trainer.devices=1', '++exp_manager.max_time_per_run=00:00:03:00', 'trainer.max_steps=20', 'model.conditioning.embed_dim=64', 'model.micro_batch_size=1', 'model.global_batch_size=1', 'model.data.synthetic_data=True', 'exp_manager.exp_dir=./foo-imagen-train', 'model.inductor=False', 'model.unet.flash_attention=False']
Traceback (most recent call last):
  File "/home/tfogal/dev/nemo/examples/multimodal/text_to_image/imagen/imagen_training.py", line 61, in main
    trainer.fit(model)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 538, in fit
    model = _maybe_unwrap_optimized(model)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/compile.py", line 132, in _maybe_unwrap_optimized
    raise TypeError(
TypeError: `model` must be a `LightningModule` or `torch._dynamo.OptimizedModule`, got `ThunderModule`

The patch to NeMo that hits this is:

$ git diff examples/
diff --git a/examples/multimodal/text_to_image/imagen/imagen_training.py b/examples/multimodal/text_to_image/imagen/imagen_training.py
index 23c1c9c1a..c30df18fc 100644
--- a/examples/multimodal/text_to_image/imagen/imagen_training.py
+++ b/examples/multimodal/text_to_image/imagen/imagen_training.py
@@ -23,6 +23,7 @@ from nemo.collections.nlp.parts.megatron_trainer_builder import MegatronTrainerB
 from nemo.core.config import hydra_runner
 from nemo.utils import logging
 from nemo.utils.exp_manager import exp_manager
+import thunder

 @hydra_runner(config_path='conf', config_name='base64-500m')
@@ -38,6 +39,7 @@ def main(cfg) -> None:
         cfg.model.precision = cfg.trainer.precision

     model = MegatronImagen(cfg.model, trainer)
+    model = thunder.jit(model)

     if cfg.model.get("inductor", False):
         # Temporary hack to get rid of TorchDynamo issue with DDP

Pitch

It would be great if Lightning could be updated to interoperate with Thunder such that we could pass the entire model to Thunder and Lightning could use / make sense of the returned ThunderModule.

Alternatives

Only use Thunder on smaller pieces of the model.

Additional context

https://github.com/NVIDIA/NeMo/blob/23baa48e441ecb6cc6b49c23bf8cfc076db38bdc/examples/multimodal/text_to_image/imagen/imagen_training.py#L26 is the source for the model.

cc @borda

carmocca commented 6 months ago

We need to add support for this. But in the meantime you should be able to get past the above error by jitting the nn.Module only, which I believe for your example becomes:

model = MegatronImagen(cfg.model, trainer)
model.model = thunder.jit(model.model)

(given https://github.com/NVIDIA/NeMo/blob/c5738263d8b4bedb0957374116d3e90746a51c37/nemo/collections/multimodal/models/text_to_image/imagen/imagen.py#L192).

tfogal commented 6 months ago

But in the meantime you should be able to get past the above error by jitting the nn.Module only

Just coming back to say: thanks! This worked like a charm.

We need to add support for this.

Agreed! Leaving this issue open to track it, but I have a workaround for now 😄 .