aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
830 stars 312 forks source link

NVidia driver installation #6416

Closed jagga13 closed 1 month ago

jagga13 commented 2 months ago

Hello,

I am wondering what is the correct process to install the NVidia driver and cuda toolkit in a pcluster environment. We have a Gitlab CI pipeline that builds our base AMI which we then use to run through the pcluster build-image process. The resulting pcluster AMI is then further processed via downstream pipelines to create AMI's for the HeadNode, ComputeNode and LoginNode. During the ComputeNode child pipeline, we use ansible role's to install a specific NVidia driver and cuda toolkit. I am able to use these AMI's to spin up a new PCluster environment and have been using non-gpu nodes successfully. However, for our GPU based partitions, I am seeing the following in the gres configuration:

# cat slurm_parallelcluster_p3-4v100-32c-244gb_partition.conf
# This file is automatically generated by pcluster

NodeName=p3-4v100-32c-244gb-dy-comp-[1-5] CPUs=32 RealMemory=237363 State=CLOUD Feature=dynamic,p3.8xlarge,comp,gpu Weight=1000

NodeSet=p3-4v100-32c-244gb_nodes Nodes=p3-4v100-32c-244gb-dy-comp-[1-5]
PartitionName=p3-4v100-32c-244gb Nodes=p3-4v100-32c-244gb_nodes MaxTime=INFINITE State=UP

# cat slurm_parallelcluster_p3-4v100-32c-244gb_gres.conf
# This file is automatically generated by pcluster
# Skipping GPUs configuration because Nvidia driver is not installed

Shouldn't the above gres config have the appropriate GPU's listed? Do we have to install the NVidia driver somehow via the pcluster build-image process? I am in the process of testing the following in my pcluster build-image config but not sure if this is the preferred method:

DevSettings:
  Cookbook:
    ExtraChefAttributes: |
      {"cluster": {"nvidia": {"enabled": true, "driver_version": "550.54.14"}}}

Would be good to know what is the supported way to install the NVidia driver and the cuda toolkit on a pcluster based compute AMI.

Thanks!

jagga13 commented 1 month ago

Turns out it this was due to nvidia-smi missing from the HeadNode AMI. Even though the headnode has no need for a GPU based instance there is dependency on having it available on the head node. We use different AMI's that are built specifically for headnode vs computenode and did not feel the need to inject the NVidia driver in the headnode AMI. Hopefully this can help anyone else that runs into a similar issue:

#
# Check if Nvidia driver is installed
# TODO: verify if it can be moved to platform cookbook later.
#
def nvidia_installed?
  nvidia_installed = ::File.exist?('/usr/bin/nvidia-smi')
  Chef::Log.warn("Nvidia driver is not installed") unless nvidia_installed
  nvidia_installed
end

Do we know if the NVidia driver version on the headnode needs to match that of the compute node as well? I assume that it does not, but want to be sure.

Thanks.

dreambeyondorange commented 1 month ago

Turns out it this was due to nvidia-smi missing from the HeadNode AMI. Even though the headnode has no need for a GPU based instance there is dependency on having it available on the head node. We use different AMI's that are built specifically for headnode vs computenode and did not feel the need to inject the NVidia driver in the headnode AMI. Hopefully this can help anyone else that runs into a similar issue:

#
# Check if Nvidia driver is installed
# TODO: verify if it can be moved to platform cookbook later.
#
def nvidia_installed?
  nvidia_installed = ::File.exist?('/usr/bin/nvidia-smi')
  Chef::Log.warn("Nvidia driver is not installed") unless nvidia_installed
  nvidia_installed
end

Do we know if the NVidia driver version on the headnode needs to match that of the compute node as well? I assume that it does not, but want to be sure.

Thanks.

Assuming you are using different AMIs for the head and compute nodes, they don't need to match in terms of driver version