NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.76k stars 285 forks source link

PowerEdge XE9680 H100 Support #611

Open doronkg opened 10 months ago

doronkg commented 10 months ago

Hi, we're maintaining an OpenShift v4.10 cluster, and recently provisioned Dell PowerEdge XE9680 servers as GPU nodes. We are working with NVIDIA GPU Operator v22.9.1 as for now (aware of the EOL) and the GPUs seem to be exposed and usable, nonetheless, we don't experience the GPU performace we were expecting.

These servers are based on NVIDIA HGX H100 architecture, and according to the NVIDIA GPU Operator v22.9.2 release notes:

  • Added support for the NVIDIA HGX H100 System in the Supported NVIDIA GPUs and Systems table on the Platform Support page.
  • Added 525.85.12 as the recommended driver version and 3.1.6 as the recommended DCGM version in the GPU Operator Component Matrix. These updates enable support for the NVIDIA HGX H100 System.

Does that mean upgrading the operator and the driver to this version could improve the reduced performance? Could you please elaborate on the improvements of this driver version?

In addition, which benchmarking tools would you recommend to test these GPUs?

doronkg commented 10 months ago

Updating status, we've upgraded the NVIDIA GPU Operator to v22.9.2, while upgrading the NVIDIA GPU Driver to v525.85.12. The v22.9.2 installs Driver v525.60.13 by default, in order to install v525.85.12 we added the following config to the clusterpolicy.nvidia.com CRD instance:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
...
spec:
  driver:
    image: >-
      nvcr.io/nvidia/driver:525.85.12-rhcos4.10
...

After the installation, we restarted the nodes and waited for all the nvidia-gpu-operator pods to run successfully. We used the following NVIDIA testing performance tool to execute a benchmark upon the H100 GPU cards.

Used the following Deployment to execute the benchmark in parallel on all GPUs in the node:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-benchmark
  namespace: gpu-tests
spec:
  replicas: 8
  selector:
    matchLabels:
      app: gpu-benchmark
  template:
    metadata:
      labels:
        app: gpu-benchmark
    spec: 
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
      containers:
        - name: gpu-benchmark
          image: nvcr.io/nvidia/pytorch:23.10-py3
          command:
            - bash
            - 'c'
            - >
              python
              ./DeepLearningExamples/PyTorch/Classification/ConvNets/multiproc.py
              --nproc_per_node 1
              ./DeepLearningExamples/PyTorch/Classification/ConvNets/launch.py
              --model resnet50 --precision AMP --mode benchmark_training
              --platform DGXA100 --data-backend synthetic --raport-file
              benchmark.json --epochs 1 --prof 100 ./ && sleep infinity
          resources:
            limits:
              cpu: 500m
              memory: 2G
              nvidia.com/gpu: '1'
            requests:
              cpu: 500m
              memory: 2G

The benchmark resulted in significant performance improvement! We observed the train.total.ips metric (images processed per second) between the two executions:

It's safe to say that the Driver upgrade was essential to achieve better and more stable performance. The Driver v525.85.12 docs reflect several references regarding H100 bug fixes and performance improvements.

We're looking forward to upgrading the NVIDIA GPU Operator to later versions and progressing towards the R535 Driver family.

doronkg commented 8 months ago

UPDATE: We've upgraded to NVIDIA GPU Operator v23.3.2 with GPU Driver v535.104.12 (recommended, not default). The benchmark resulted in train.total_ips average of ~2600 ips in each iteration.