Open mpatel31415 opened 1 week ago
Hey @eqy this seems to be an eager mode bug, not related to thunder at all. Could you / group take a look at this?
Actually it's related to benchmark_litgpt.py script. I know one possible fix for it, so I can prepare PR around Wednesday, but it won't solve missing results from calculate_model_flops
function.
🐛 Bug
When running the benchmarks for Mixtral-8x7B-v0.1 for Eager mode we get error:
I see in the log that there was a message:
It might be caused by the fact that in this code in benchmark_litgpt.py:
we have both
self.calculate_model_flops()
and throughput in try catch block. I'd put there only calculate_model_flops() but maybe there were some problems in getting Throughput and I'm just not aware of them.Another possible fix is to check if
tokens_per_sec
is present in the dictionary before accessing it.To Reproduce
Please use:
8 node(s), each with 8 GPUs. Image "INTERNAL_IMAGE:pjnl-20241001"
Training script: python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \ --model_name Mixtral-8x7B-v0.1 \ --distributed_mode fsdp \ --shard_mode zero3 \ --compile eager \ --checkpoint_activations True \ --low_precision_mode none \ --micro_batch_size 1
Expected behavior
We should be able to run the benchmarking script, even if we are not able print a few metrics.
Environment
system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.7 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.13+git4cbd7a4 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+gitd6d9183 libraries.pip.torchmetrics 1.4.2 libraries.pip.torchvision 0.19.0a0+d23a6e1