Open ichernev opened 3 weeks ago
This is after the crash:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 36C P0 115W / 700W | 26911MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 |
| N/A 29C P0 110W / 700W | 27457MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 |
| N/A 30C P0 114W / 700W | 48119MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 34C P0 115W / 700W | 55873MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
Before the crash the mem-usage is somewhat the same across GPUs, but something explodes at the end
Running on 4xH100 as specified in readme:
I ran it with NCCL_DEBUG=INFO, let me know if you need the output (didn't see anything interesting) other than maybe (the end):