Closed jagga13 closed 1 month ago
Turns out it this was due to nvidia-smi missing from the HeadNode AMI. Even though the headnode has no need for a GPU based instance there is dependency on having it available on the head node. We use different AMI's that are built specifically for headnode vs computenode and did not feel the need to inject the NVidia driver in the headnode AMI. Hopefully this can help anyone else that runs into a similar issue:
#
# Check if Nvidia driver is installed
# TODO: verify if it can be moved to platform cookbook later.
#
def nvidia_installed?
nvidia_installed = ::File.exist?('/usr/bin/nvidia-smi')
Chef::Log.warn("Nvidia driver is not installed") unless nvidia_installed
nvidia_installed
end
Do we know if the NVidia driver version on the headnode needs to match that of the compute node as well? I assume that it does not, but want to be sure.
Thanks.
Turns out it this was due to nvidia-smi missing from the HeadNode AMI. Even though the headnode has no need for a GPU based instance there is dependency on having it available on the head node. We use different AMI's that are built specifically for headnode vs computenode and did not feel the need to inject the NVidia driver in the headnode AMI. Hopefully this can help anyone else that runs into a similar issue:
# # Check if Nvidia driver is installed # TODO: verify if it can be moved to platform cookbook later. # def nvidia_installed? nvidia_installed = ::File.exist?('/usr/bin/nvidia-smi') Chef::Log.warn("Nvidia driver is not installed") unless nvidia_installed nvidia_installed end
Do we know if the NVidia driver version on the headnode needs to match that of the compute node as well? I assume that it does not, but want to be sure.
Thanks.
Assuming you are using different AMIs for the head and compute nodes, they don't need to match in terms of driver version
Hello,
I am wondering what is the correct process to install the NVidia driver and cuda toolkit in a pcluster environment. We have a Gitlab CI pipeline that builds our base AMI which we then use to run through the
pcluster build-image
process. The resulting pcluster AMI is then further processed via downstream pipelines to create AMI's for the HeadNode, ComputeNode and LoginNode. During the ComputeNode child pipeline, we use ansible role's to install a specific NVidia driver and cuda toolkit. I am able to use these AMI's to spin up a new PCluster environment and have been using non-gpu nodes successfully. However, for our GPU based partitions, I am seeing the following in the gres configuration:Shouldn't the above gres config have the appropriate GPU's listed? Do we have to install the NVidia driver somehow via the pcluster build-image process? I am in the process of testing the following in my pcluster build-image config but not sure if this is the preferred method:
Would be good to know what is the supported way to install the NVidia driver and the cuda toolkit on a pcluster based compute AMI.
Thanks!