Open vedantroy opened 2 months ago
nvidia-smi
output will be great.python -c 'import cudnn; print(cudnn.backend_version_string())'
CuDNN version: 9.1.0
.
nvidia-smi:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000002:00:01.0 Off | 0 |
| N/A 31C P0 116W / 700W | 885MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000002:00:02.0 Off | 0 |
| N/A 30C P0 72W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000002:00:03.0 Off | 0 |
| N/A 29C P0 70W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000002:00:04.0 Off | 0 |
| N/A 27C P0 68W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000003:00:01.0 Off | 0 |
| N/A 28C P0 68W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000003:00:02.0 Off | 0 |
| N/A 29C P0 73W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000003:00:03.0 Off | 0 |
| N/A 30C P0 70W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000003:00:04.0 Off | 0 |
| N/A 27C P0 69W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 239227 C /tmp/env/bin/python 876MiB |
+---------------------------------------------------------------------------------------+
I've attached both the benchmarking and CuDNN wrapper code to this post. I suspect the benchmarking code is off, so I'll switch to something simpler (like the Pytorch profiler), and see what the results are.
I wrote a helper that allows someone to use CuDNN attention within Pytorch seamlessly.
However, while this gets better forward pass performance. It gets far worse backwards pass performance. Any thoughts on why this might be the case? I'm hoping there might be some obvious deficiency in my code.
(Unit is ms).