Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.26k stars 3.38k forks source link

Deadlock on `log_dict` with different keys across ranks #19106

Open meakbiyik opened 10 months ago

meakbiyik commented 10 months ago

Bug description

Logging dictionaries across ranks with different keys lead to NCCL silently dying. The expected behavior is for only the existing keys across dictionaries to be averaged.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

Use multiple GPUs!

import pytorch_lightning as pl
import torch
import random

class SimpleNetwork(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = torch.nn.functional.mse_loss(y_hat, y)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

    def train_dataloader(self):
        return torch.utils.data.DataLoader(
            torch.utils.data.TensorDataset(torch.randn(100, 32), torch.randn(100, 2)),
            batch_size=32,
        )

    # use log_dict to log metrics
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = torch.nn.functional.mse_loss(y_hat, y)
        dict_to_log = {'loss': loss}
        randomly_add_another_metric = random.random()
        if randomly_add_another_metric > 0.5:
            dict_to_log['another_metric'] = torch.tensor(random.random())
        self.log_dict(
            dict_to_log,
            on_step=False,
            on_epoch=True,
            sync_dist=True,
            batch_size=32,
        )
        return loss

def train_network():
    # Initialize the Lightning Trainer
    trainer = pl.Trainer(
        max_epochs=10,
    )

    # Create an instance of your SimpleNetwork
    model = SimpleNetwork()

    # Start the training
    trainer.fit(model)

train_network()

Error messages and logs

[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
/myenv/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:191: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python examples/nccl_bug.py ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[I socket.cpp:452] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:502] [c10d - debug] The server socket is attempting to listen on [::]:32903.
[I socket.cpp:576] [c10d] The server socket has started to listen on [::]:32903.
[I TCPStore.cpp:252] [c10d - debug] The server has started on port = 32903.
[I socket.cpp:686] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 32903).
[I socket.cpp:761] [c10d - trace] The client socket is attempting to connect to [localhost]:32903.
[I socket.cpp:849] [c10d] The client socket has connected to [localhost]:32903 on [localhost]:49152.
[I TCPStore.cpp:261] [c10d - debug] TCP client connected to host 127.0.0.1:32903
[I socket.cpp:297] [c10d - debug] The server socket on [::]:32903 has accepted a connection from [localhost]:49152.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[I socket.cpp:686] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 32903).
[I socket.cpp:761] [c10d - trace] The client socket is attempting to connect to [localhost]:32903.
[I socket.cpp:297] [c10d - debug] The server socket on [::]:32903 has accepted a connection from [localhost]:49156.
[I socket.cpp:849] [c10d] The client socket has connected to [localhost]:32903 on [localhost]:49156.
[I TCPStore.cpp:261] [c10d - debug] TCP client connected to host 127.0.0.1:32903
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
[I socket.cpp:686] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 32903).
[I socket.cpp:761] [c10d - trace] The client socket is attempting to connect to [localhost]:32903.
[I socket.cpp:849] [c10d] The client socket has connected to [localhost]:32903 on [localhost]:49168.
[I TCPStore.cpp:261] [c10d - debug] TCP client connected to host 127.0.0.1:32903
[I socket.cpp:297] [c10d - debug] The server socket on [::]:32903 has accepted a connection from [localhost]:49168.
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
[I socket.cpp:686] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 32903).
[I socket.cpp:761] [c10d - trace] The client socket is attempting to connect to [localhost]:32903.
[I socket.cpp:849] [c10d] The client socket has connected to [localhost]:32903 on [localhost]:49180.
[I TCPStore.cpp:261] [c10d - debug] TCP client connected to host 127.0.0.1:32903
[I socket.cpp:297] [c10d - debug] The server socket on [::]:32903 has accepted a connection from [localhost]:49180.
[I ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 1, NCCL_ENABLE_TIMING: 1, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, NCCL_DEBUG: INFO, ID=94642481533456
[I ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 1, NCCL_ENABLE_TIMING: 1, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, NCCL_DEBUG: INFO, ID=94638613432384
[I ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 1, NCCL_ENABLE_TIMING: 1, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, NCCL_DEBUG: INFO, ID=94495567931760
[I ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 1, NCCL_ENABLE_TIMING: 1, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, NCCL_DEBUG: INFO, ID=94904165283424
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: INFO
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=1, OpType=BROADCAST, TensorShape=[61], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=1, OpType=BROADCAST, TensorShape=[61], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=1, OpType=BROADCAST, TensorShape=[61], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=1, OpType=BROADCAST, TensorShape=[61], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=2OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=2OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=2OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=2OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=3, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=3, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=3, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=3, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=4, OpType=BROADCAST, TensorShape=[73], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=4, OpType=BROADCAST, TensorShape=[73], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=4, OpType=BROADCAST, TensorShape=[73], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=4, OpType=BROADCAST, TensorShape=[73], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=5OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=5OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=5OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=5OpType=BARRIER)
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=6, OpType=ALLGATHER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=6, OpType=ALLGATHER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=6, OpType=ALLGATHER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=6, OpType=ALLGATHER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=7, OpType=BROADCAST, TensorShape=[6], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=7, OpType=BROADCAST, TensorShape=[6], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=7, OpType=BROADCAST, TensorShape=[6], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=7, OpType=BROADCAST, TensorShape=[6], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=8, OpType=BROADCAST, TensorShape=[66], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=8, OpType=BROADCAST, TensorShape=[66], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=8, OpType=BROADCAST, TensorShape=[66], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=8, OpType=BROADCAST, TensorShape=[66], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I reducer.cpp:127] Reducer initialized with bucket_bytes_cap: 26214400 first_bucket_bytes_cap: 1048576
[I reducer.cpp:127] Reducer initialized with bucket_bytes_cap: 26214400 first_bucket_bytes_cap: 1048576
[I reducer.cpp:127] Reducer initialized with bucket_bytes_cap: 26214400 first_bucket_bytes_cap: 1048576
[I reducer.cpp:127] Reducer initialized with bucket_bytes_cap: 26214400 first_bucket_bytes_cap: 1048576
[I logger.cpp:215] [Rank 3]: DDP Initialized with: 
broadcast_buffers: 1
bucket_cap_bytes: 26214400
find_unused_parameters: 0
gradient_as_bucket_view: 0
has_sync_bn: 0
is_multi_device_module: 0
iteration: 0
num_parameter_tensors: 2
output_device: 3
rank: 3
total_parameter_size_bytes: 264
world_size: 4
backend_name: nccl
bucket_sizes: 264
cuda_visible_devices: 0,1,2,3
device_ids: 3
dtypes: float
master_addr: 127.0.0.1
master_port: 32903
module_name: SimpleNetwork
nccl_async_error_handling: N/A
nccl_blocking_wait: N/A
nccl_debug: INFO
nccl_ib_timeout: N/A
nccl_nthreads: N/A
nccl_socket_ifname: N/A
torch_distributed_debug: DETAIL

[I logger.cpp:215] [Rank 0]: DDP Initialized with: 
broadcast_buffers: 1
bucket_cap_bytes: 26214400
find_unused_parameters: 0
gradient_as_bucket_view: 0
has_sync_bn: 0
is_multi_device_module: 0
iteration: 0
num_parameter_tensors: 2
output_device: 0
rank: 0
total_parameter_size_bytes: 264
world_size: 4
backend_name: nccl
bucket_sizes: 264
cuda_visible_devices: 0,1,2,3
device_ids: 0
dtypes: float
master_addr: 127.0.0.1
master_port: 32903
module_name: SimpleNetwork
nccl_async_error_handling: N/A
nccl_blocking_wait: N/A
nccl_debug: INFO
nccl_ib_timeout: N/A
nccl_nthreads: N/A
nccl_socket_ifname: N/A
torch_distributed_debug: DETAIL

[I logger.cpp:215] [Rank 1]: DDP Initialized with: 
broadcast_buffers: 1
bucket_cap_bytes: 26214400
find_unused_parameters: 0
gradient_as_bucket_view: 0
has_sync_bn: 0
is_multi_device_module: 0
iteration: 0
num_parameter_tensors: 2
output_device: 1
rank: 1
total_parameter_size_bytes: 264
world_size: 4
backend_name: nccl
bucket_sizes: 264
cuda_visible_devices: 0,1,2,3
device_ids: 1
dtypes: float
master_addr: 127.0.0.1
master_port: 32903
module_name: SimpleNetwork
nccl_async_error_handling: N/A
nccl_blocking_wait: N/A
nccl_debug: INFO
nccl_ib_timeout: N/A
nccl_nthreads: N/A
nccl_socket_ifname: N/A
torch_distributed_debug: DETAIL

[I logger.cpp:215] [Rank 2]: DDP Initialized with: 
broadcast_buffers: 1
bucket_cap_bytes: 26214400
find_unused_parameters: 0
gradient_as_bucket_view: 0
has_sync_bn: 0
is_multi_device_module: 0
iteration: 0
num_parameter_tensors: 2
output_device: 2
rank: 2
total_parameter_size_bytes: 264
world_size: 4
backend_name: nccl
bucket_sizes: 264
cuda_visible_devices: 0,1,2,3
device_ids: 2
dtypes: float
master_addr: 127.0.0.1
master_port: 32903
module_name: SimpleNetwork
nccl_async_error_handling: N/A
nccl_blocking_wait: N/A
nccl_debug: INFO
nccl_ib_timeout: N/A
nccl_nthreads: N/A
nccl_socket_ifname: N/A
torch_distributed_debug: DETAIL

[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=9OpType=BARRIER)

  | Name  | Type   | Params
---------------------------------
0 | layer | Linear | 66    
---------------------------------
66        Trainable params
0         Non-trainable params
66        Total params
0.000     Total estimated model params size (MB)
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=9OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=9OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=9OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=10OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=10OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=10OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=10OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=11OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=11OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=11OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=11OpType=BARRIER)
/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=4` in the `DataLoader` to improve performance.
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=12OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=12OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=12OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=12OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=13, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=13, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=13, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=13, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=14OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=14OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=14OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=14OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=15, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=15, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=15, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=15, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
/myenv/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py:293: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=16, OpType=ALLREDUCE, TensorShape=[66], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=16, OpType=ALLREDUCE, TensorShape=[66], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=16, OpType=ALLREDUCE, TensorShape=[66], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=16, OpType=ALLREDUCE, TensorShape=[66], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=18, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=18, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=18, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=18, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=19OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=19OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=19OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=19OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=20, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=20, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=20, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=20, OpType=ALLREDUCE, TensorShape=[], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=21OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=21OpType=BARRIER)
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=21, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=21, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
Traceback (most recent call last):
  File "/mydir/examples/nccl_bug.py", line 59, in <module>
    train_network()
  File "/mydir/examples/nccl_bug.py", line 57, in train_network
    trainer.fit(model)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
Traceback (most recent call last):
  File "/mydir/examples/nccl_bug.py", line 59, in <module>
    train_network()
  File "/mydir/examples/nccl_bug.py", line 57, in train_network
    trainer.fit(model)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
Traceback (most recent call last):
  File "/mydir/examples/nccl_bug.py", line 59, in <module>
    train_network()
  File "/mydir/examples/nccl_bug.py", line 57, in train_network
    trainer.fit(model)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
Traceback (most recent call last):
  File "/mydir/examples/nccl_bug.py", line 59, in <module>
    train_network()
  File "/mydir/examples/nccl_bug.py", line 57, in train_network
    trainer.fit(model)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
    call._call_and_handle_interrupt(
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    call._call_and_handle_interrupt(
    call._call_and_handle_interrupt(
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    call._call_and_handle_interrupt(
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/myenv/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
  File "/myenv/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
  File "/myenv/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/myenv/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
    return function(*args, **kwargs)
    return function(*args, **kwargs)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
    return function(*args, **kwargs)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
    self._run(model, ckpt_path=ckpt_path)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
    self._run(model, ckpt_path=ckpt_path)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
    self._run(model, ckpt_path=ckpt_path)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
    results = self._run_stage()
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
    results = self._run_stage()
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
    results = self._run_stage()
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
    results = self._run_stage()
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
    self.fit_loop.run()
    self.fit_loop.run()
  File "/myenv/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 203, in run
  File "/myenv/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 203, in run
    self.fit_loop.run()
  File "/myenv/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 203, in run
    self.fit_loop.run()
  File "/myenv/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 203, in run
    self.on_advance_end()
  File "/myenv/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 374, in on_advance_end
    self.on_advance_end()
    self.on_advance_end()
  File "/myenv/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 374, in on_advance_end
  File "/myenv/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 374, in on_advance_end
    self.on_advance_end()
  File "/myenv/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 374, in on_advance_end
    call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=True)
    call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=True)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
    call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=True)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
    call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=True)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 309, in on_train_epoch_end
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 311, in on_train_epoch_end
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 311, in on_train_epoch_end
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 309, in on_train_epoch_end
    monitor_candidates = self._monitor_candidates(trainer)
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 646, in _monitor_candidates
    self._save_topk_checkpoint(trainer, monitor_candidates)
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 370, in _save_topk_checkpoint
    monitor_candidates = self._monitor_candidates(trainer)
    self._save_topk_checkpoint(trainer, monitor_candidates)
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 646, in _monitor_candidates
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 370, in _save_topk_checkpoint
    monitor_candidates = deepcopy(trainer.callback_metrics)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1634, in callback_metrics
    self._save_none_monitor_checkpoint(trainer, monitor_candidates)
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 688, in _save_none_monitor_checkpoint
    self._save_none_monitor_checkpoint(trainer, monitor_candidates)
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 688, in _save_none_monitor_checkpoint
    monitor_candidates = deepcopy(trainer.callback_metrics)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1634, in callback_metrics
    filepath = self._get_metric_interpolated_filepath_name(monitor_candidates, trainer)
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 639, in _get_metric_interpolated_filepath_name
    return self._logger_connector.callback_metrics
    while self.file_exists(filepath, trainer) and filepath != del_filepath:
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 231, in callback_metrics
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 752, in file_exists
    filepath = self._get_metric_interpolated_filepath_name(monitor_candidates, trainer)
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 639, in _get_metric_interpolated_filepath_name
    return trainer.strategy.broadcast(exists)
  File "/myenv/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 298, in broadcast
    return self._logger_connector.callback_metrics
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 231, in callback_metrics
    while self.file_exists(filepath, trainer) and filepath != del_filepath:
  File "/myenv/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 752, in file_exists
    metrics = self.metrics["callback"]
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 226, in metrics
    return trainer.strategy.broadcast(exists)
  File "/myenv/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 298, in broadcast
    metrics = self.metrics["callback"]
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 226, in metrics
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/myenv/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return self.trainer._results.metrics(on_step)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py", line 471, in metrics
    return self.trainer._results.metrics(on_step)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py", line 471, in metrics
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/myenv/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/myenv/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2603, in broadcast_object_list
    return func(*args, **kwargs)
  File "/myenv/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2603, in broadcast_object_list
    value = self._get_cache(result_metric, on_step)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py", line 435, in _get_cache
    value = self._get_cache(result_metric, on_step)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py", line 435, in _get_cache
    result_metric.compute()
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py", line 280, in wrapped_func
    result_metric.compute()
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py", line 280, in wrapped_func
    self._computed = compute(*args, **kwargs)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py", line 243, in compute
    self._computed = compute(*args, **kwargs)
  File "/myenv/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py", line 243, in compute
    value = self.meta.sync(self.value.clone())  # `clone` because `sync` is in-place
  File "/myenv/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 330, in reduce
    value = self.meta.sync(self.value.clone())  # `clone` because `sync` is in-place
  File "/myenv/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 330, in reduce
    return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
  File "/myenv/python3.10/site-packages/lightning_fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available
    return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
  File "/myenv/python3.10/site-packages/lightning_fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/myenv/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/myenv/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
    return _sync_ddp(result, group=group, reduce_op=reduce_op)
  File "/myenv/python3.10/site-packages/lightning_fabric/utilities/distributed.py", line 220, in _sync_ddp
    return _sync_ddp(result, group=group, reduce_op=reduce_op)
  File "/myenv/python3.10/site-packages/lightning_fabric/utilities/distributed.py", line 220, in _sync_ddp
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/myenv/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/myenv/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
    torch.distributed.barrier(group=group)
    torch.distributed.barrier(group=group)
  File "/myenv/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
  File "/myenv/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    work = default_pg.broadcast([tensor], opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=21, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 2 is running collective: CollectiveFingerPrint(SequenceNumber=0OpType=REDUCE).Collectives differ in the following aspects:    Sequence number: 21vs 0  Op type: BROADCASTvs REDUCE  Tensor Tensor shapes: 1vs   Tensor Tensor dtypes: Longvs   Tensor Tensor devices: TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))vs 
    return func(*args, **kwargs)
    return func(*args, **kwargs)
  File "/myenv/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3698, in barrier
  File "/myenv/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3698, in barrier
    work = default_pg.broadcast([tensor], opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=21, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 2 is running collective: CollectiveFingerPrint(SequenceNumber=0OpType=REDUCE).Collectives differ in the following aspects:    Sequence number: 21vs 0  Op type: BROADCASTvs REDUCE  Tensor Tensor shapes: 1vs   Tensor Tensor dtypes: Longvs   Tensor Tensor devices: TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))vs 
    work = group.barrier(opts=opts)
    work = group.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 3 is running collective: CollectiveFingerPrint(SequenceNumber=21OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=0OpType=GATHER).Collectives differ in the following aspects:      Sequence number: 21vs 0  Op type: BARRIERvs GATHER
RuntimeError: Detected mismatch between collectives on ranks. Rank 2 is running collective: CollectiveFingerPrint(SequenceNumber=21OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=0OpType=GATHER).Collectives differ in the following aspects:      Sequence number: 21vs 0  Op type: BARRIERvs GATHER
[I ProcessGroupNCCL.cpp:874] [Rank 3] Destroyed 1communicators on CUDA device 3
[I ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1
[I ProcessGroupNCCL.cpp:874] [Rank 0] Destroyed 1communicators on CUDA device 0
[I ProcessGroupNCCL.cpp:874] [Rank 2] Destroyed 1communicators on CUDA device 2

Environment

Current environment * CUDA: - GPU: None - available: False - version: 11.8 * Lightning: - lightning: 2.0.7 - lightning-cloud: 0.5.46 - lightning-utilities: 0.9.0 - pytorch-lightning: 2.1.0 - torch: 2.1.0+cu118 - torchcache: 0.3.2 - torchmetrics: 1.2.0 - torchvision: 0.16.0+cu118 * Packages: - affine: 2.4.0 - aiohttp: 3.8.6 - aiosignal: 1.3.1 - annotated-types: 0.6.0 - anyio: 3.7.1 - appdirs: 1.4.4 - argon2-cffi: 23.1.0 - argon2-cffi-bindings: 21.2.0 - arrow: 1.3.0 - asttokens: 2.4.1 - async-timeout: 4.0.3 - attrs: 23.1.0 - av: 10.0.0 - backoff: 2.2.1 - beautifulsoup4: 4.12.2 - black: 23.10.1 - bleach: 6.1.0 - blessed: 1.20.0 - boto3: 1.28.75 - botocore: 1.31.75 - brotli: 1.1.0 - certifi: 2023.7.22 - cffi: 1.16.0 - cfgv: 3.4.0 - charset-normalizer: 3.3.1 - click: 8.1.7 - click-plugins: 1.1.1 - cligj: 0.7.2 - comm: 0.1.4 - contextily: 1.4.0 - contourpy: 1.1.1 - coverage: 7.3.2 - croniter: 1.4.1 - csaps: 1.1.0 - cycler: 0.12.1 - dateutils: 0.6.12 - debugpy: 1.8.0 - decorator: 5.1.1 - deepdiff: 6.6.1 - defusedxml: 0.7.1 - distlib: 0.3.7 - docker-pycreds: 0.4.0 - einops: 0.6.1 - entrypoints: 0.4 - exceptiongroup: 1.1.3 - executing: 2.0.1 - fastapi: 0.104.1 - fastjsonschema: 2.18.1 - filelock: 3.13.1 - fiona: 1.9.5 - flake8: 6.1.0 - fonttools: 4.43.1 - fqdn: 1.5.1 - frechetdist: 0.6 - frozenlist: 1.4.0 - fsspec: 2023.10.0 - geographiclib: 2.0 - geopandas: 0.14.0 - geopy: 2.4.0 - gitdb: 4.0.11 - gitpython: 3.1.40 - gopro2gpx: 0.1 - gvtnet: 0.1.0 - h11: 0.14.0 - huggingface-hub: 0.18.0 - identify: 2.5.31 - idna: 3.4 - iniconfig: 2.0.0 - inquirer: 3.1.3 - ipykernel: 6.26.0 - ipython: 8.17.2 - ipython-genutils: 0.2.0 - isoduration: 20.11.0 - isort: 5.12.0 - itsdangerous: 2.1.2 - jedi: 0.19.1 - jinja2: 3.1.2 - jmespath: 1.0.1 - joblib: 1.3.2 - jsonpointer: 2.4 - jsonschema: 4.19.2 - jsonschema-specifications: 2023.7.1 - jupyter-client: 8.5.0 - jupyter-core: 5.5.0 - jupyter-events: 0.8.0 - jupyter-server: 2.9.1 - jupyter-server-terminals: 0.4.4 - jupyterlab-pygments: 0.2.2 - kiwisolver: 1.4.5 - kornia: 0.6.12 - lightning: 2.0.7 - lightning-cloud: 0.5.46 - lightning-utilities: 0.9.0 - markdown-it-py: 3.0.0 - markupsafe: 2.1.3 - matplotlib: 3.8.0 - matplotlib-inline: 0.1.6 - mccabe: 0.7.0 - mdurl: 0.1.2 - memray: 1.10.0 - mercantile: 1.2.1 - mistune: 3.0.2 - mpmath: 1.3.0 - msgpack: 1.0.7 - multidict: 6.0.4 - mypy-extensions: 1.0.0 - natsort: 8.4.0 - nbclassic: 1.0.0 - nbclient: 0.8.0 - nbconvert: 7.10.0 - nbformat: 5.9.2 - nest-asyncio: 1.5.8 - networkx: 3.2.1 - nodeenv: 1.8.0 - notebook: 6.5.4 - notebook-shim: 0.2.3 - numpy: 1.26.1 - opencv-python-headless: 4.8.1.78 - ordered-set: 4.1.0 - osmnx: 1.7.1 - overrides: 7.4.0 - packaging: 23.2 - pandas: 1.5.3 - pandocfilters: 1.5.0 - parso: 0.8.3 - pathspec: 0.11.2 - pathtools: 0.1.2 - patsy: 0.5.3 - pexpect: 4.8.0 - pillow: 10.1.0 - pip: 23.1.1 - platformdirs: 3.11.0 - pluggy: 1.3.0 - pre-commit: 3.5.0 - prometheus-client: 0.18.0 - prompt-toolkit: 3.0.39 - protobuf: 4.24.4 - psutil: 5.9.6 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - py-spy: 0.3.14 - pycodestyle: 2.11.1 - pycparser: 2.21 - pydantic: 2.1.1 - pydantic-core: 2.4.0 - pyflakes: 3.1.0 - pygments: 2.16.1 - pyjwt: 2.8.0 - pyparsing: 3.1.1 - pyproj: 3.6.1 - pytest: 7.4.3 - pytest-cov: 4.1.0 - pytest-datadir: 1.5.0 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-json-logger: 2.0.7 - python-multipart: 0.0.6 - pytorch-lightning: 2.1.0 - pytz: 2023.3.post1 - pyyaml: 6.0.1 - pyzmq: 25.1.1 - rasterio: 1.3.9 - readchar: 4.0.5 - referencing: 0.30.2 - requests: 2.31.0 - rfc3339-validator: 0.1.4 - rfc3986-validator: 0.1.1 - rich: 13.6.0 - rpds-py: 0.10.6 - s3transfer: 0.7.0 - safetensors: 0.4.0 - scipy: 1.11.3 - seaborn: 0.12.2 - segment-anything: 1.0 - send2trash: 1.8.2 - sentry-sdk: 1.33.1 - setproctitle: 1.3.3 - setuptools: 68.2.2 - setuptools-scm: 8.0.4 - shapely: 2.0.2 - six: 1.16.0 - smmap: 5.0.1 - sniffio: 1.3.0 - snuggs: 1.4.7 - soupsieve: 2.5 - stack-data: 0.6.3 - starlette: 0.27.0 - starsessions: 1.3.0 - statsmodels: 0.14.0 - sympy: 1.12 - terminado: 0.17.1 - timm: 0.9.8 - tinycss2: 1.2.1 - tomli: 2.0.1 - torch: 2.1.0+cu118 - torchcache: 0.3.2 - torchmetrics: 1.2.0 - torchvision: 0.16.0+cu118 - tornado: 6.3.3 - tqdm: 4.66.1 - traitlets: 5.13.0 - triton: 2.1.0 - types-python-dateutil: 2.8.19.14 - typing-extensions: 4.8.0 - uri-template: 1.3.0 - urllib3: 2.0.7 - uvicorn: 0.23.2 - virtualenv: 20.24.6 - wandb: 0.15.12 - wcwidth: 0.2.9 - webcolors: 1.13 - webencodings: 0.5.1 - websocket-client: 1.6.4 - websockets: 12.0 - wheel: 0.40.0 - xyzservices: 2023.10.1 - yarl: 1.9.2 - zstd: 1.5.5.1 * System: - OS: Linux - architecture: - 64bit - ELF - processor: - python: 3.10.3 - release: 4.19.0-25-amd64 - version: #1 SMP Debian 4.19.289-2 (2023-08-08)

More info

No response

awaelchli commented 10 months ago

@meakbiyik The unbalanced reduction of metrics is generally not supported anywhere. Implementing this would be quite challenging. What is the real use case for this?

We definitely expect the user to supply a value on every rank, that's the current contract.

meakbiyik commented 10 months ago

Hi @awaelchli. My use case was to report a metric that is only valid for a subset of samples, e.g., according to some binning strategy. I would consider this as an important use case, since a researcher might want to keep track of the behavior of a model under "hard" or "easy" examples depending on some metric, which would lead to empty buckets across different batches in different devices.

However, looking at the code, I note that a dict is not sent between the devices, but rather self.log is called per each value in the dict: https://github.com/Lightning-AI/pytorch-lightning/blob/7d04de697e6e2fa3705c45b15c1efb6ed9745475/src/lightning/pytorch/core/module.py#L585-L600

I also found a workaround (though it is not straightforward in any way): one can define a new torchmetric:

    def __init__(self):
        ...
        self.nan_metric = torchmetrics.MeanMetric()

    def on_train_epoch_end(self):
        self.nan_metric.reset()

And log a "nan" value when an empty tensor is encountered:

        ...
        if randomly_add_another_metric > 0.5:
            batch_value = self.nan_metric(random.random())
        else:
            batch_value = self.nan_metric(float('nan'))
        dict_to_log['another_metric'] = batch_value
        self.log_dict(
            dict_to_log,
            on_step=False,
            on_epoch=True,
            sync_dist=True,
            batch_size=32,
        )

There are two issues here:

  1. Lightning does not raise a proper rank-zero error when there are unbalanced reduction of metrics - I got this error log only after enabling finer error reporting from NCCL, and otherwise the training just hangs.
  2. Lightning should either natively support nan-reductions, e.g., reduce_fx="nanmean", or document this alternative very clearly in the docs for the self.log.

If seems valid, I can give it a shot as well.

awaelchli commented 10 months ago

However, looking at the code, I note that a dict is not sent between the devices, but rather self.log is called per each value in the dict

The choices here are deliberate, it is by design.

Lightning does not raise a proper rank-zero error when there are unbalanced reduction of metrics

It is intentionally like this. Making an explicit check here would require a costly synchronization and eliminate all benefits of the logging system.

The nan-reduction types can be implemented, but we should be careful not to overload the reduce_fx functionality here unless it is absolutely needed. The intention here is that for non-trivial metrics and reductions, the user would reach to torchmetrics which is the recommended way to handle metrics in a distributed fashion.

meakbiyik commented 10 months ago

The dict logging design is completely valid, of course, since aggregation is handled later per key, which implies that there is an underlying map implementation for such a thing already. However, that second point:

It is intentionally like this. Making an explicit check here would require a costly synchronization and eliminate all benefits of the logging system.

This I am not sure. Such an error only happens when sync_dist is run, so it happens during a synchronization anyway, or do I misunderstand it?

A relatively simple solution for this issue would be to detect if one of the ranks do not log a particular metric during synchronization (triggered due to sync_dist), and raise a zero-rank error asking the user to use torchmetrics and log a nan instead.