On clusters created with Ubuntu 18.04 and Ubuntu 20.04 official AMIs, nvidia-fabricmanager will be
automatically updated to an incompatible version and stop working when nodes are launched.
The impact is limited to EC2 instances and applications that make use of NVIDIA Fabric Manager.
At the time of writing only p4d instances are affected.
Affected ParallelCluster versions: >= 2.10.0
The root-cause
Issue started on Jul 21 2021 when Ubuntu published the nvidia-fabricmanager package to its official repo: http://archive.ubuntu.com/ubuntu/pool/multiverse/f/fabric-manager-460/.
Since then, unattended-upgrades, that are enabled by default on ParallelCluster Ubuntu AMIs, are causing the Fabric Manager to be upgraded to a version that is incompatible with the installed NVIDIA drivers.
The issue
On clusters created with Ubuntu 18.04 and Ubuntu 20.04 official AMIs, nvidia-fabricmanager will be automatically updated to an incompatible version and stop working when nodes are launched.
The impact is limited to EC2 instances and applications that make use of NVIDIA Fabric Manager. At the time of writing only p4d instances are affected.
Affected ParallelCluster versions: >= 2.10.0
The root-cause
Issue started on Jul 21 2021 when Ubuntu published the nvidia-fabricmanager package to its official repo: http://archive.ubuntu.com/ubuntu/pool/multiverse/f/fabric-manager-460/. Since then, unattended-upgrades, that are enabled by default on ParallelCluster Ubuntu AMIs, are causing the Fabric Manager to be upgraded to a version that is incompatible with the installed NVIDIA drivers.
The workaround
See https://github.com/aws/aws-parallelcluster/wiki/NVIDIA-Fabric-Manager-stops-running-on-Ubuntu-18.04-and-Ubuntu-20.04