Closed parambole closed 11 months ago
When USE_TCPX is set to yes then unsupported flags NCCL_GPUDIRECTTCPX_SOCKET_IFNAME and NCCL_SOCKET_IFNAME are set for SLURM cluster due to which the training process fails to start.
NCCL_GPUDIRECTTCPX_SOCKET_IFNAME
NCCL_SOCKET_IFNAME
When USE_TCPX is set to yes then unsupported flags
NCCL_GPUDIRECTTCPX_SOCKET_IFNAME
andNCCL_SOCKET_IFNAME
are set for SLURM cluster due to which the training process fails to start.