NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 817 forks source link

Cannot use P2P in Azure GPU cluster #1214

Closed daoterog closed 8 months ago

daoterog commented 8 months ago

I am new to NCCL and multi-gpu training. My code ran perfectly on my Laptop's GPU (single RTX 3060) and it runs out of memory using four GPUs. I think it may be due to a misconfiguration of my GPUs or misuse of DDP strategy in Lightning. I hope someone can help me debug the log messages NCCL is leaving. Since they are very long, I'll paste here just the logs that come from the main rank of the process. I have experienced different errors that I think are related to memory. These are the ones I can track back:

OSError: [Errno 28] No space left on device

RuntimeError: cuDNN error: CUDNN_STATUS_ALLOC_FAILED

torch.cuda.OutOfMemoryError: CUDA out of memory.

RuntimeError: DataLoader worker (pid 4748) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The only time it gave a different error is when I manually set NCCL_IB_DISABLE=0. It gave me:

File "/mnt/azureml/cr/j/e01ca930a056451cad891d256ce58f06/exe/wd/models/ssl/monitor_metrics.py", line 44, in rankme
    S = torch.linalg.svdvals(Z)  # pylint: disable=invalid-name, not-callable
RuntimeError: cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR, when calling `cusolverDnCreate(handle)`. If you keep seeing this error, you may use `torch.backends.cuda.preferred_linalg_library()` to try linear algebra operators with other supported backends.

As some additional info:

I am running a job a cluster with four Teslta T4 GPUs. Specifically, this cluster Standard_NC64as_T4_v3.

I have been using Azure Containers for Pytorch and installing additional dependencies as they recommend. Below I pasted the Dockerfiles I have been using to build my environments. I commented the second base image to avoid posting two different Dockerfiles with the same content.

FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-2.0-cuda11.7
# FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-2.1-cuda12.1

RUN pip install timm
RUN pip install scikit-learn
RUN pip install mlflow    

I checked and the env using cuda 12.1 is using NCCL version 12.18.3 and the one using cuda 11.7 is using 12.17.1.

Also, I am specifying a distribution when launching the job using the command function. I understand that this will tell the system to use the four GPUs. Nonetheless, I experienced the same issue whenever I didn't specify the distribution in the command.

# Create or update the component
    print("Creating job...")
    print(job_command)
    command_job = command(
        experiment_name="testing-ssl-byol",
        description=description,
        code=str(code_dir),
        environment=enviornment,
        inputs=inputs,
        outputs=outputs,
        command=job_command,
        compute="Testing-GPU-Cluster",
        distribution=MpiDistribution(process_count_per_instance=4),
        environment_variables={"NCCL_DEBUG": "DEBUG", "NCCL_IB_DISABLE": "0"},
        tags={"project": "ssl-research", "job-purpose": "testing"}
    )
    job = ml_client.jobs.create_or_update(command_job)
    print(f"Job created with ID: {job.id}")

Here are the log messages:

[2024-03-05 15:37:45,277] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:118: UserWarning: onnxruntime training package info: package_name: onnxruntime-training
  warnings.warn("onnxruntime training package info: package_name: %s" % package_name)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:119: UserWarning: onnxruntime training package info: __version__: 1.17.0
  warnings.warn("onnxruntime training package info: __version__: %s" % version)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:120: UserWarning: onnxruntime training package info: cuda_version: 12.2
  warnings.warn("onnxruntime training package info: cuda_version: %s" % cuda_version)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:121: UserWarning: onnxruntime build info: cudart_version: 12020
  warnings.warn("onnxruntime build info: cudart_version: %s" % cudart_version)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:129: UserWarning: WARNING: failed to find cudart version that matches onnxruntime build info
  warnings.warn("WARNING: failed to find cudart version that matches onnxruntime build info")
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:130: UserWarning: WARNING: found cudart versions: [12010]
  warnings.warn("WARNING: found cudart versions: %s" % local_cudart_versions)
Global seed set to 42
Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[2024-03-05 15:37:54,375] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-05 15:37:54,398] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-05 15:37:54,398] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:118: UserWarning: onnxruntime training package info: package_name: onnxruntime-training
  warnings.warn("onnxruntime training package info: package_name: %s" % package_name)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:119: UserWarning: onnxruntime training package info: __version__: 1.17.0
  warnings.warn("onnxruntime training package info: __version__: %s" % version)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:120: UserWarning: onnxruntime training package info: cuda_version: 12.2
  warnings.warn("onnxruntime training package info: cuda_version: %s" % cuda_version)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:121: UserWarning: onnxruntime build info: cudart_version: 12020
  warnings.warn("onnxruntime build info: cudart_version: %s" % cudart_version)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:129: UserWarning: WARNING: failed to find cudart version that matches onnxruntime build info
  warnings.warn("WARNING: failed to find cudart version that matches onnxruntime build info")
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:130: UserWarning: WARNING: found cudart versions: [12010]
  warnings.warn("WARNING: found cudart versions: %s" % local_cudart_versions)
[rank: 3] Global seed set to 42
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:118: UserWarning: onnxruntime training package info: package_name: onnxruntime-training
  warnings.warn("onnxruntime training package info: package_name: %s" % package_name)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:119: UserWarning: onnxruntime training package info: __version__: 1.17.0
  warnings.warn("onnxruntime training package info: __version__: %s" % version)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:120: UserWarning: onnxruntime training package info: cuda_version: 12.2
  warnings.warn("onnxruntime training package info: cuda_version: %s" % cuda_version)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:121: UserWarning: onnxruntime build info: cudart_version: 12020
  warnings.warn("onnxruntime build info: cudart_version: %s" % cudart_version)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:129: UserWarning: WARNING: failed to find cudart version that matches onnxruntime build info
  warnings.warn("WARNING: failed to find cudart version that matches onnxruntime build info")
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:130: UserWarning: WARNING: found cudart versions: [12010]
  warnings.warn("WARNING: found cudart versions: %s" % local_cudart_versions)
[rank: 1] Global seed set to 42
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:118: UserWarning: onnxruntime training package info: package_name: onnxruntime-training
  warnings.warn("onnxruntime training package info: package_name: %s" % package_name)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:119: UserWarning: onnxruntime training package info: __version__: 1.17.0
  warnings.warn("onnxruntime training package info: __version__: %s" % version)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:120: UserWarning: onnxruntime training package info: cuda_version: 12.2
  warnings.warn("onnxruntime training package info: cuda_version: %s" % cuda_version)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:121: UserWarning: onnxruntime build info: cudart_version: 12020
  warnings.warn("onnxruntime build info: cudart_version: %s" % cudart_version)
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:129: UserWarning: WARNING: failed to find cudart version that matches onnxruntime build info
  warnings.warn("WARNING: failed to find cudart version that matches onnxruntime build info")
/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:130: UserWarning: WARNING: found cudart versions: [12010]
  warnings.warn("WARNING: found cudart versions: %s" % local_cudart_versions)
[rank: 2] Global seed set to 42
[rank: 2] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
[rank: 3] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
[rank: 1] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:349 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:349 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.5<0>
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:349 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:349 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:349 [0] NCCL INFO cudaDriverVersion 12010
NCCL version 2.18.3+cuda12.1
e2bd2729de1e4961bccb1c0d631e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Plugin Path : /opt/nccl-rdma-sharp-plugins/lib/libnccl-net.so
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P plugin IBext
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO NET/IB : No device found.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Using network Socket
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO comm 0x9ffe0160 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 100000 commId 0x1e29799c32293b9b - Init START
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 0(=100000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 0(=100000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 0(=100000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 1(=200000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 2(=300000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 2(=300000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 2(=300000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 3(=400000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 3(=400000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between co1f3a4000001:902:2618 [1] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:9nnected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 0(=100000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 0(=100000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 0(=100000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 1(=200000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 2(=300000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 2(=300000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 2(=300000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 3(=400000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 3(=400000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO NVLS multicast support is not available on dev 0
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Channel 00/02 :    0   1   2   3
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Channel 01/02 :    0   1   2   3
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P Chunksize set to 131072
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Rank 0 selecting transport for rank 3
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 3(=400000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 0 canConnect 0
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 1 canConnect 1
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Rank 0 selecting transport for rank 3
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 3(=400000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 0 canConnect 0
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 1 canConnect 1
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Rank 0 selecting trans03:3893 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000)
e2bd2729de1e4961bccb1LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

   | Name               | Type                  | Params
--------------------------------------------------------------
0  | criterion          | BCEWithLogitsLoss     | 0     
1  | backbone           | ResNet                | 11.2 M
2  | classifier         | Linear                | 513   
3  | train_metrics      | ModuleDict            | 0     
4  | val_metrics        | ModuleDict            | 0     
5  | test_metrics       | ModuleDict            | 0     
6  | knn_acc_metric     | WeightedKNNClassifier | 0     
7  | momentum_backbone  | ResNet                | 11.2 M
8  | projector          | Sequential            | 1.6 M 
9  | momentum_projector | Sequential            | 1.6 M 
10 | predictor          | Sequential            | 1.1 M 
--------------------------------------------------------------
13.8 M    Trainable params
12.8 M    Non-trainable params
26.6 M    Total params
53.134    Total estimated model params size (MB)
Number of CPU cores: 32

port for rank 1
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 0 canConnect 0
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 1 canConnect 1
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Rank 0 selecting transport for rank 1
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 0 canConnect 0
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 1 canConnect 1
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Connected all rings
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Rank 0 selecting transport for rank 1
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 0 canConnect 0
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 1 canConnect 1
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Rank 0 selecting transport for rank 1
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000)
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 0 canConnect 0
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 1 canConnect 1
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Connected all trees
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO MSCCL: No external scheduler found, using internal implementation
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO MSCCL: Internal Scheduler will use /usr/lib/x86_64-linux-gnu/msccl-algorithms as algorithm directory and /usr/lib/x86_64-linux-gnu/../share/nccl/msccl-algorithms as share algorithm directory and /usr/share/nccl/msccl-algorithms as package installed share algorithm directory 
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Using MSCCL Algo files from /usr/share/nccl/msccl-algorithms
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO MSCCL: Initialization finished
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO comm 0x9ffe0160 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 100000 commId 0x1e29799c32293b9b - Init COMPLETE

Sanity Checking: 0it [00:00, ?it/s]
Sanity Checking:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:  50%|█████     | 1/2 [00:02<00:02,  2.25s/it]
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:02<00:00,  1.15s/it]
                                                                           /opt/conda/envs/ptca/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:43: UserWarning: The ``compute`` method of metric BinaryRecall was called before the ``update`` method which may lead to errors, as metric states have not yet been updated.
  warnings.warn(*args, **kwargs)  # noqa: B028
/opt/conda/envs/ptca/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:539: PossibleUserWarning: It is recommended to use `self.log('val_classif_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
  warning_cache.warn(
/opt/conda/envs/ptca/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1609: PossibleUserWarning: The number of training batches (2) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(

Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/4 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/4 [00:00<?, ?it/s] 
  File "/mnt/azureml/cr/j/e01ca930a056451cad891d256ce58f06/exe/wd/main.py", line 336, in <module>
    raise e  # Re-raise the exception to handle it normally or to stop the program.

#####################################################################################################

Different errors are happening here in the middle

#####################################################################################################

e2bd2729de1e4961bccb1c0d6311f3a4000001:904:3903 [3] NCCL INFO [Service thread] Connection closed by localRank 3
e2bd2729de1e4961bccb1c0d6311f3a4000001:902:3902 [1] NCCL INFO [Service thread] Connection closed by localRank 1

Epoch 0:   0%|          | 0/4 [00:06<?, ?it/s]e2bd2729de1e4961bccb1c0d6311f3a4000001:349:3901 [0] NCCL INFO [Service thread] Connection closed by localRank 0
e2bd2729de1e4961bccb1c0d6311f3a4000001:903:3900 [2] NCCL INFO [Service thread] Connection closed by localRank 2
e2bd2729de1e4961bccb1c0d6311f3a4000001:904:904 [3] NCCL INFO MSCCL: Teardown finished
e2bd2729de1e4961bccb1c0d6311f3a4000001:904:904 [3] NCCL INFO comm 0xa015e5a0 rank 3 nranks 4 cudaDev 3 busId 400000 - Abort COMPLETE
e2bd2729de1e4961bccb1c0d6311f3a4000001:902:902 [1] NCCL INFO MSCCL: Teardown finished
e2bd2729de1e4961bccb1c0d6311f3a4000001:902:902 [1] NCCL INFO comm 0xa00b5190 rank 1 nranks 4 cudaDev 1 busId 200000 - Abort COMPLETE
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:349 [0] NCCL INFO MSCCL: Teardown finished
e2bd2729de1e4961bccb1c0d6311f3a4000001:349:349 [0] NCCL INFO comm 0x9ffe0160 rank 0 nranks 4 cudaDev 0 busId 100000 - Abort COMPLETE
e2bd2729de1e4961bccb1c0d6311f3a4000001:903:903 [2] NCCL INFO MSCCL: Teardown finished
e2bd2729de1e4961bccb1c0d6311f3a4000001:903:903 [2] NCCL INFO comm 0xa00cfe30 rank 2 nranks 4 cudaDev 2 busId 300000 - Abort COMPLETE
sjeaugey commented 8 months ago

Could it be you're running out of shared mem space in /dev/shm (or didn't provide enough shared mem in the container?).

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#shared-memory

daoterog commented 8 months ago

Thanks for the quick reply @sjeaugey!

I am currently looking into how to change the shared memory in the container (I am using azure and they initialize the containers themselves). I noticed that whenever I set NCCL_DEBUG=WARN the NCCL WARN Error: failed to extend /dev/shm/nccl... warning/error never showed.

Could this be a shared memory issue even if this doesn't appear as it says in the documentation?

AddyLaddy commented 8 months ago

We improved the detection of the SHM exhaustion and the WARN message in NCCL 2.19.x

PhamVietXuan commented 8 months ago

hello, i having the same problem as you when training my model in AzureML. Have you fixed this problem yet?

daoterog commented 8 months ago

Yes! @PhamVietXuan

It actually was a shared memory issue and the distribution I was using. I discovered a shm_size parameter in the command function that sets it when building the docker container. I tried using the --ulimit memlock=-1 as suggested in NVIDIA's troubleshooting page by passing it to the docker_args argument of the command function, but it seems to be blocked. Nonetheless, setting shm_size=64g, which was my total GPU memory, and changing the MPI distribution to Pytorch made the trick.

I am still not very sure what changed when I changed the distribution, but things are running smoothly!

command_job = command(
        experiment_name="testing-ssl-byol",
        description=description,
        code=str(code_dir),
        environment=enviornment,
        inputs=inputs,
        outputs=outputs,
        command=job_command,
        compute="Testing-GPU-Cluster",

######################################################

        distribution=PyTorchDistribution(process_count_per_instance=4),
        shm_size="64g",

######################################################
        tags={"project": "ssl-research", "job-purpose": "testing"},
    )