Closed tpdownes closed 4 days ago
This was manually tested on the default a3-highgpu-8g blueprint in https://github.com/GoogleCloudPlatform/cluster-toolkit/ and observed to work.
Before the change:
$ srun -N2 --label md5sum /var/lib/tcpx/lib64/libnccl-net.so
0: d20b62ba38cd140c54a16d46982a43ef /var/lib/tcpx/lib64/libnccl-net.so
1: d20b62ba38cd140c54a16d46982a43ef /var/lib/tcpx/lib64/libnccl-net.so
After the change:
$ srun -N2 --label md5sum /var/lib/tcpx/lib64/libnccl-net.so
1: 293526e53c204f583903a51fde9aed58 /var/lib/tcpx/lib64/libnccl-net.so
0: 293526e53c204f583903a51fde9aed58 /var/lib/tcpx/lib64/libnccl-net.so
We observe failures of NCCL plugin installation when the default Docker network profile is used because it fails to bind to a real interface that can route the instance metadata server. This causes machine-type verification to fail in some instances.
Additional commit applies
shfmt
rules throughout the scripts.