Open doronkg opened 10 months ago
Updating status, we've upgraded the NVIDIA GPU Operator to v22.9.2, while upgrading the NVIDIA GPU Driver to v525.85.12.
The v22.9.2 installs Driver v525.60.13 by default, in order to install v525.85.12 we added the following config to the clusterpolicy.nvidia.com
CRD instance:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
...
spec:
driver:
image: >-
nvcr.io/nvidia/driver:525.85.12-rhcos4.10
...
After the installation, we restarted the nodes and waited for all the nvidia-gpu-operator
pods to run successfully.
We used the following NVIDIA testing performance tool to execute a benchmark upon the H100 GPU cards.
Used the following Deployment to execute the benchmark in parallel on all GPUs in the node:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-benchmark
namespace: gpu-tests
spec:
replicas: 8
selector:
matchLabels:
app: gpu-benchmark
template:
metadata:
labels:
app: gpu-benchmark
spec:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
containers:
- name: gpu-benchmark
image: nvcr.io/nvidia/pytorch:23.10-py3
command:
- bash
- 'c'
- >
python
./DeepLearningExamples/PyTorch/Classification/ConvNets/multiproc.py
--nproc_per_node 1
./DeepLearningExamples/PyTorch/Classification/ConvNets/launch.py
--model resnet50 --precision AMP --mode benchmark_training
--platform DGXA100 --data-backend synthetic --raport-file
benchmark.json --epochs 1 --prof 100 ./ && sleep infinity
resources:
limits:
cpu: 500m
memory: 2G
nvidia.com/gpu: '1'
requests:
cpu: 500m
memory: 2G
The benchmark resulted in significant performance improvement!
We observed the train.total.ips
metric (images processed per second) between the two executions:
It's safe to say that the Driver upgrade was essential to achieve better and more stable performance. The Driver v525.85.12 docs reflect several references regarding H100 bug fixes and performance improvements.
We're looking forward to upgrading the NVIDIA GPU Operator to later versions and progressing towards the R535 Driver family.
UPDATE:
We've upgraded to NVIDIA GPU Operator v23.3.2 with GPU Driver v535.104.12 (recommended, not default).
The benchmark resulted in train.total_ips
average of ~2600 ips in each iteration.
Hi, we're maintaining an OpenShift v4.10 cluster, and recently provisioned Dell PowerEdge XE9680 servers as GPU nodes. We are working with NVIDIA GPU Operator v22.9.1 as for now (aware of the EOL) and the GPUs seem to be exposed and usable, nonetheless, we don't experience the GPU performace we were expecting.
These servers are based on NVIDIA HGX H100 architecture, and according to the NVIDIA GPU Operator v22.9.2 release notes:
Does that mean upgrading the operator and the driver to this version could improve the reduced performance? Could you please elaborate on the improvements of this driver version?
In addition, which benchmarking tools would you recommend to test these GPUs?