Closed wprazuch closed 3 hours ago
~The Mixture of Experts models and Mixtral particularly are not currently supported. There's a tracking issue https://github.com/Lightning-AI/lightning-thunder/issues/194.~ It's a problem with the benchmark script itself, see the next comment.
measure_flops(meta_model, model_fwd, model_loss)
inside benchmarks/benchmark_litgpt.py
uses meta tensors and PyTorch rightfully errors out for this case. (added in https://github.com/Lightning-AI/lightning-thunder/commit/348597fd045903aa232b6811bf6bffa392edbd65)
How useful is this approach to computing model flops? https://github.com/Lightning-AI/lightning-thunder/blob/72e033a0e0dfe44d4770dec2399a9058971003ec/thunder/benchmarks/benchmark_litgpt.py#L387
Can we remove it from the benchmark script? Can we skip it for unsupported models? @parthmannan, do you have opinions here?
triage review:
Yes, we can make it optional and can be enabled only with a script argument. I'll make those changes and submit a PR. cc - @carmocca Carlos, do you think we should be file a bug in lightning as the error comes from the throughput measurement code from lightning Fabric?
🐛 Bug
There is unsupported error when running models:
For compile: thunder_inductor_cat_cudnn and thunder_cudnn, both fsdp zero3. Running on 8 nodes, 8 gpus each.
To Reproduce
Steps to reproduce the behavior:
Run in the container:
Expected behavior
The model should run or we should get OOM error.
Environment
As in the Docker image
Additional context
Traceback:
cc @crcrpar