aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
830 stars 312 forks source link

NVIDIA Fabric Manager stops running on Ubuntu 18.04 and Ubuntu 20.04 #3034

Closed demartinofra closed 3 years ago

demartinofra commented 3 years ago

The issue

On clusters created with Ubuntu 18.04 and Ubuntu 20.04 official AMIs, nvidia-fabricmanager will be automatically updated to an incompatible version and stop working when nodes are launched.

The impact is limited to EC2 instances and applications that make use of NVIDIA Fabric Manager. At the time of writing only p4d instances are affected.

Affected ParallelCluster versions: >= 2.10.0

The root-cause

Issue started on Jul 21 2021 when Ubuntu published the nvidia-fabricmanager package to its official repo: http://archive.ubuntu.com/ubuntu/pool/multiverse/f/fabric-manager-460/. Since then, unattended-upgrades, that are enabled by default on ParallelCluster Ubuntu AMIs, are causing the Fabric Manager to be upgraded to a version that is incompatible with the installed NVIDIA drivers.

The workaround

See https://github.com/aws/aws-parallelcluster/wiki/NVIDIA-Fabric-Manager-stops-running-on-Ubuntu-18.04-and-Ubuntu-20.04

lukeseawalker commented 3 years ago

Fixed in version 2.11.2