GoogleCloudPlatform / slurm-gcp

Apache License 2.0
26 stars 20 forks source link

Resolve observed failure of NCCL plugin installation #219

Closed tpdownes closed 4 days ago

tpdownes commented 4 days ago

We observe failures of NCCL plugin installation when the default Docker network profile is used because it fails to bind to a real interface that can route the instance metadata server. This causes machine-type verification to fail in some instances.

Additional commit applies shfmt rules throughout the scripts.

tpdownes commented 4 days ago

This was manually tested on the default a3-highgpu-8g blueprint in https://github.com/GoogleCloudPlatform/cluster-toolkit/ and observed to work.

Before the change:

$ srun -N2 --label md5sum /var/lib/tcpx/lib64/libnccl-net.so
0: d20b62ba38cd140c54a16d46982a43ef  /var/lib/tcpx/lib64/libnccl-net.so
1: d20b62ba38cd140c54a16d46982a43ef  /var/lib/tcpx/lib64/libnccl-net.so

After the change:

$ srun -N2 --label md5sum /var/lib/tcpx/lib64/libnccl-net.so
1: 293526e53c204f583903a51fde9aed58  /var/lib/tcpx/lib64/libnccl-net.so
0: 293526e53c204f583903a51fde9aed58  /var/lib/tcpx/lib64/libnccl-net.so