Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.16k stars 77 forks source link

TypeError for Mixtral-8x7B-v0.1: unsupported format string passed to NoneType.__format__ #1267

Open mpatel31415 opened 1 week ago

mpatel31415 commented 1 week ago

🐛 Bug

When running the benchmarks for Mixtral-8x7B-v0.1 for Eager mode we get error:

0: [rank0]: File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 887, in benchmark_main 0: [rank0]: print(f"Tokens/s: {benchmark.perf_metrics['tokens_per_sec']:.02f}") 0: [rank0]: TypeError: unsupported format string passed to NoneType.format

I see in the log that there was a message:

Model Flops/Throughput calculation failed for model Mixtral-8x7B-v0.1. Skipping throughput metric collection.

It might be caused by the fact that in this code in benchmark_litgpt.py:

    try:
        # Calculate the model FLOPs
        self.calculate_model_flops()
        # Setup throughput Collection
        self.throughput = Throughput(window_size=self.max_iters - self.warmup_iters, world_size=world_size)
    except:
        self.throughput = None
        print(
            f"Model Flops/Throughput calculation failed for model {self.model_name}. Skipping throughput metric collection."
        )

we have both self.calculate_model_flops() and throughput in try catch block. I'd put there only calculate_model_flops() but maybe there were some problems in getting Throughput and I'm just not aware of them.

Another possible fix is to check if tokens_per_sec is present in the dictionary before accessing it.

To Reproduce

Please use:

8 node(s), each with 8 GPUs. Image "INTERNAL_IMAGE:pjnl-20241001"

Training script: python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \ --model_name Mixtral-8x7B-v0.1 \ --distributed_mode fsdp \ --shard_mode zero3 \ --compile eager \ --checkpoint_activations True \ --low_precision_mode none \ --micro_batch_size 1

Expected behavior

We should be able to run the benchmarking script, even if we are not able print a few metrics.

Environment

system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.7 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.13+git4cbd7a4 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+gitd6d9183 libraries.pip.torchmetrics 1.4.2 libraries.pip.torchvision 0.19.0a0+d23a6e1

tfogal commented 4 days ago

Hey @eqy this seems to be an eager mode bug, not related to thunder at all. Could you / group take a look at this?

mpatel31415 commented 1 day ago

Actually it's related to benchmark_litgpt.py script. I know one possible fix for it, so I can prepare PR around Wednesday, but it won't solve missing results from calculate_model_flops function.