GoogleCloudPlatform / ai-infra-cluster-provisioning

Apache License 2.0
37 stars 27 forks source link

[P2] Unsupported envvar are set for SLURM cluster #343

Closed parambole closed 11 months ago

parambole commented 11 months ago

When USE_TCPX is set to yes then unsupported flags NCCL_GPUDIRECTTCPX_SOCKET_IFNAME and NCCL_SOCKET_IFNAME are set for SLURM cluster due to which the training process fails to start.