Open wprazuch opened 1 month ago
Failed to reproduce the error on a machine with 8 H100 cards in an interactive manner
Could be a one-off problem either with software or hardware. Wojciech, please let us know if the problem persists in the next runs!
It could be due to the high batch size --micro_batch_size 18
. Sometimes PyTorch and CUDA are not good at communicating that there's not enough memory (https://stackoverflow.com/a/64040256).
When running tiny-llama-1.1b in Thunder we get an error:
🐛 Bug
Full traceback:
To Reproduce
Please use: 1 node(s), each with 8 GPUs. Image ""INTERNAL_IMAGE:pjnl-20241011"" Training script: python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \ --model_name tiny-llama-1.1b \ --distributed_mode fsdp \ --shard_mode zero3 \ --compile thunder \ --checkpoint_activations False \ --low_precision_mode none \ --micro_batch_size 18
Expected behavior
We should not see this error.
Environment
system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.7 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.15+gitf3a2087 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+git4e89977 libraries.pip.torchmetrics 1.4.3 libraries.pip.torchvision 0.19.0a0+d23a6e1"