SLURM training: training freezes when using `ddp` and torchdata

knoriy commented 1 year ago

Bug description

Training freezes when using ddp on slurm cluster (dp runs as expected). The dataset is loaded via torchdata from an s3 bucket. Similar behaviour also arises when using webdataset.

Possibly a linked issue: https://github.com/Lightning-AI/lightning/issues/16893#issue-1602261381

Error:

No Error is thrown

UPDATE:

Removing val_step and test_step from pl.LightningModule gives us the following:

Epoch 0: : 27it [00:10,  2.57it/s, losTraceback (most recent call last):
  File "/fsx/knoriy/code/deep-learning-project-template/project/lit_td.py", line 152, in <module>
    cli_main()
  File "/fsx/knoriy/code/deep-learning-project-template/project/lit_td.py", line 148, in cli_main
    trainer.fit(model, datamodule=data)
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
    self._run_train()
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
    self.fit_loop.run()
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.on_advance_end()
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 295, in on_advance_end
    self.trainer._call_callback_hooks("on_train_epoch_end")
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1394, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 304, in on_train_epoch_end
    self._save_topk_checkpoint(trainer, monitor_candidates)
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 363, in _save_topk_checkpoint
    self._save_none_monitor_checkpoint(trainer, monitor_candidates)
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 666, in _save_none_monitor_checkpoint
    filepath = self._get_metric_interpolated_filepath_name(monitor_candidates, trainer)
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 621, in _get_metric_interpolated_filepath_name
    while self.file_exists(filepath, trainer) and filepath != del_filepath:
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 729, in file_exists
    return trainer.strategy.broadcast(exists)
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 314, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/fsx/home-knoriy/miniconda3/envs/clasp/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2090, in broadcast_object_list
    object_tensor = torch.empty(  # type: ignore[call-overload]
TypeError: empty(): argument 'size' must be tuple of SymInts, but found element of type int at pos 1
srun: error: ip-26-0-130-13: task 1: Exited with exit code 1

How to reproduce the bug

import io
import json
from argparse import ArgumentParser

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

import pytorch_lightning as pl

import torchdata
import soundfile
import librosa
import numpy as np

from typing import Optional

class MyModule(nn.Module):
    '''
    simpel model
    '''
    def __init__(self, hidden_dim) -> None:
        super().__init__()
        self.l1 = torch.nn.Conv1d(80, hidden_dim, 3)

    def forward(self, x):
        return self.l1(x)

class LitClassifier(pl.LightningModule):
    def __init__(self, hidden_dim=128, learning_rate=1e-3):
        super().__init__()
        self.save_hyperparameters()

        self.model = MyModule(self.hparams.hidden_dim)

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        out = self(batch)
        return F.mse_loss(out, out*2)

    def validation_step(self, batch, batch_idx):
        self(batch)

    def test_step(self, batch, batch_idx):
        self(batch)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)

    @staticmethod
    def add_model_specific_args(parent_parser):
        parser = ArgumentParser(parents=[parent_parser], add_help=False)
        parser.add_argument('--hidden_dim', type=int, default=128)
        parser.add_argument('--learning_rate', type=float, default=0.0001)
        return parser

class DataModule(pl.LightningDataModule):
    def __init__(self, batch_size: int = 32, num_workers=0):
        super().__init__()
        self.batch_size = batch_size
        self.num_workers = num_workers

    def setup(self, stage:Optional[str] = None):
        urls = ['s3://<bucket>/1.tar', 's3://<bucket>/2.tar', 's3://<bucket>/3.tar','s3://<bucket>/<n>.tar']
        self.train = self.get_datapipe(urls)
        self.val = self.get_datapipe(urls)
        self.test = self.get_datapipe(urls)
    def to_sampels(self, data):
        a, t = data
        return soundfile.read(io.BytesIO(a[1].read())), json.loads(t[1].read().decode('utf-8'))

    def get_datapipe(self, data_dir):
        datapipe = torchdata.datapipes.iter.IterableWrapper(data_dir)\
            .shuffle()\
            .sharding_filter()\
            .open_files_by_fsspec(mode='rb')\
            .load_from_tar() \
            .batch(2) \
            .map(self.to_sampels)
        return datapipe

    def collate_fn(self, data):
        mels = []
        for (a, _) in data:
            mel = librosa.feature.melspectrogram(y=a[0], sr=a[1], fmin=0, fmax=8000, n_mels=80, n_fft=1024, win_length=1024, hop_length=512)
            mel = librosa.power_to_db(mel, ref=np.max)
            mels.append(torch.tensor(mel, dtype=torch.float32).T)

        mels = pad_sequence(mels).permute(1,2,0)
        return mels

    def train_dataloader(self):
        return DataLoader(self.train, batch_size=self.batch_size, num_workers=self.num_workers, collate_fn=self.collate_fn)

    def val_dataloader(self):
        return DataLoader(self.val, batch_size=self.batch_size, num_workers=self.num_workers, collate_fn=self.collate_fn)

    def test_dataloader(self):
        return DataLoader(self.test, batch_size=self.batch_size, num_workers=self.num_workers, collate_fn=self.collate_fn)

def cli_main():
    pl.seed_everything(1234)

    # ------------
    # args
    # ------------
    parser = ArgumentParser()
    parser.add_argument('--batch_size', default=16, type=int)
    parser.add_argument('--num_workers', default=6, type=int)

    parser = pl.Trainer.add_argparse_args(parser)
    parser = LitClassifier.add_model_specific_args(parser)
    args = parser.parse_args()

    # ------------
    # data
    # ------------
    data = DataModule(num_workers=args.num_workers)

    # ------------
    # model
    # ------------
    model = LitClassifier(args.hidden_dim, args.learning_rate)

    # ------------
    # training
    # ------------
    trainer = pl.Trainer.from_argparse_args(args)
    trainer.fit(model, datamodule=data)

if __name__ == '__main__':
    cli_main()

## Sbatch submit.sh

```shell
#!/bin/bash
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-node=2
#SBATCH --cpus-per-gpu=12
#SBATCH --output=%j.out
#SBATCH --signal=SIGUSR1@90

# debugging flags
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1

srun /<user-Home>/miniconda3/envs/<ENV Name>/bin/python project/train.py \
    --max_epochs 3 \
    --accelerator gpu \
    --strategy ddp \
    --num_nodes 1 \
    --devices 2

Error messages and logs

[rank: 1] Global seed set to 1234
[rank: 0] Global seed set to 1234
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Global seed set to 1234
[rank: 1] Global seed set to 1234
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA A100-SXM4-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
ip-26-0-128-136:3266267:3266267 [0] NCCL INFO Bootstrap : Using ens32:26.0.128.136<0>
ip-26-0-128-136:3266267:3266267 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-26-0-128-136:3266267:3266267 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-26-0-128-136:3266267:3266267 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.14.3+cuda11.7
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO NET/OFI Configuring AWS-specific options
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Using network AWS Libfabric
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Setting affinity for GPU 2 to 3f000000,00003f00
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 00/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 01/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 02/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 03/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 04/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 05/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 06/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 07/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 08/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 09/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 10/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 11/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 12/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 13/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 14/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 15/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 16/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 17/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 18/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 19/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 20/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 21/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 22/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 23/24 :    0   1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 00/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 01/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 02/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 03/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 04/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 05/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 06/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 07/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 08/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 09/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 10/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266268 [1] NCCL INFO cudaDriverVersion 12000
ip-26-0-128-136:3266268:3266268 [1] NCCL INFO Bootstrap : Using ens32:26.0.128.136<0>
ip-26-0-128-136:3266268:3266268 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-26-0-128-136:3266268:3266268 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO NET/OFI Configuring AWS-specific options
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Using network AWS Libfabric
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Setting affinity for GPU 3 to 3f000000,00003f00
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] -1/-1/-1->1->0 [7] -1/-1/-1->1->0 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] -1/-1/-1->1->0 [19] -1/-1/-1->1->0 [20] -1/-1/-1->1->0 [21] -1/-1/-1->1->0 [22] -1/-1/-1->1->0 [23] -1/-1/-1->1->0
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 00/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 01/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 02/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 03/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 04/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 05/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 06/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 07/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 08/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 09/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 10/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 11/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 12/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 13/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 14/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 15/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 16/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 17/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 18/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 19/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 20/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 21/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 22/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Channel 23/0 : 1[201d0] -> 0[201c0] via P2P/IPC/read
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Connected all rings
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [2,3]

  | Name  | Type     | Params
-----------------------------------
0 | model | MyModule | 30.8 K
-----------------------------------
30.8 K    Trainable params
0         Non-trainable params
30.8 K    Total params
0.123     Total estimated model params size (MB)
SLURM auto-requeueing enabled. Setting signal handlers.
SLURM auto-requeueing enabled. Setting signal handlers.
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 11/0 : 0[201c0] -
Sanity Checking: 0it [00:00, ?it/s]
Sanity Checking:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:  50%|█████     | 1/2 [00:03<00:03,  3.51s/it]
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:03<00:00,  1.76s/it]

Training: 0it [00:00, ?it/s]
Training: 0it [00:00, ?it/s]
Epoch 0: : 0it [00:00, ?it/s]> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 12/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 13/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 14/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 15/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 16/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 17/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 18/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 19/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 20/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 21/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 22/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Channel 23/0 : 0[201c0] -> 1[201d0] via P2P/IPC/read
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Connected all rings
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO Connected all trees
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
ip-26-0-128-136:3266267:3266536 [0] NCCL INFO comm 0x55c843fb21f0 rank 0 nranks 2 cudaDev 0 busId 201c0 - Init COMPLETE
[W reducer.cpp:1298] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO Connected all trees
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
ip-26-0-128-136:3266268:3266537 [1] NCCL INFO comm 0x562c2f4a3c50 rank 1 nranks 2 cudaDev 1 busId 201d0 - Init COMPLETE
[W reducer.cpp:1298] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

Epoch 0: : 1it [00:02,  2.99s/it]
Epoch 0: : 1it [00:03,  3.22s/it, loss=511, v_num=49625]
Epoch 0: : 2it [00:03,  1.62s/it, loss=511, v_num=49625]
Epoch 0: : 2it [00:03,  1.82s/it, loss=496, v_num=49625]
Epoch 0: : 3it [00:03,  1.22s/it, loss=496, v_num=49625]
Epoch 0: : 3it [00:04,  1.40s/it, loss=537, v_num=49625]
Epoch 0: : 4it [00:04,  1.06s/it, loss=537, v_num=49625]
Epoch 0: : 4it [00:04,  1.20s/it, loss=548, v_num=49625]
Epoch 0: : 5it [00:04,  1.04it/s, loss=548, v_num=49625]
Epoch 0: : 5it [00:05,  1.10s/it, loss=557, v_num=49625]
Epoch 0: : 6it [00:05,  1.09it/s, loss=557, v_num=49625]
Epoch 0: : 6it [00:06,  1.02s/it, loss=557, v_num=49625]
Epoch 0: : 7it [00:06,  1.15it/s, loss=557, v_num=49625]
Epoch 0: : 7it [00:06,  1.04it/s, loss=553, v_num=49625]
Epoch 0: : 8it [00:06,  1.19it/s, loss=553, v_num=49625]
Epoch 0: : 8it [00:07,  1.04it/s, loss=546, v_num=49625]
Epoch 0: : 9it [00:07,  1.17it/s, loss=546, v_num=49625]
Epoch 0: : 9it [00:08,  1.04it/s, loss=530, v_num=49625]
Epoch 0: : 10it [00:

Environment

Current environment

``` * CUDA: - GPU: - NVIDIA A100-SXM4-40GB - NVIDIA A100-SXM4-40GB - available: True - version: 11.7 * Lightning: - lightning-utilities: 0.8.0 - pytorch-lightning: 1.9.4 - torch: 1.13.1 - torchaudio: 0.12.1 - torchdata: 0.5.1 - torchmetrics: 0.9.3 * Packages: - absl-py: 1.2.0 - aiobotocore: 2.4.2 - aiohttp: 3.8.3 - aioitertools: 0.11.0 - aiosignal: 1.2.0 - appdirs: 1.4.4 - async-timeout: 4.0.2 - attrs: 22.1.0 - audioread: 3.0.0 - botocore: 1.27.59 - braceexpand: 0.1.7 - cachetools: 5.2.0 - certifi: 2022.9.24 - cffi: 1.15.1 - charset-normalizer: 2.1.1 - click: 8.1.3 - contourpy: 1.0.5 - cycler: 0.11.0 - decorator: 5.1.1 - deepspeed: 0.8.2 - docker-pycreds: 0.4.0 - filelock: 3.8.0 - fonttools: 4.37.4 - frozenlist: 1.3.1 - fsspec: 2023.3.0 - gitdb: 4.0.10 - gitpython: 3.1.31 - google-auth: 2.12.0 - google-auth-oauthlib: 0.4.6 - grpcio: 1.49.1 - hjson: 3.1.0 - huggingface-hub: 0.10.0 - idna: 3.4 - importlib-metadata: 5.0.0 - inflect: 6.0.0 - jmespath: 1.0.1 - joblib: 1.2.0 - kiwisolver: 1.4.4 - librosa: 0.9.2 - lightning-utilities: 0.8.0 - llvmlite: 0.39.1 - markdown: 3.4.1 - markupsafe: 2.1.1 - matplotlib: 3.6.0 - mkl-fft: 1.3.1 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - more-itertools: 8.14.0 - multidict: 6.0.2 - ninja: 1.11.1 - numba: 0.56.2 - numpy: 1.23.1 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: 8.5.0.96 - oauthlib: 3.2.1 - packaging: 21.3 - pathtools: 0.1.2 - pillow: 9.2.0 - pip: 22.2.2 - pooch: 1.6.0 - portalocker: 2.7.0 - protobuf: 3.19.6 - psutil: 5.9.4 - py-cpuinfo: 9.0.0 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pycparser: 2.21 - pydantic: 1.10.2 - pydeprecate: 0.3.2 - pyparsing: 3.0.9 - python-dateutil: 2.8.2 - pytorch-lightning: 1.9.4 - pyyaml: 6.0 - regex: 2022.9.13 - requests: 2.28.1 - requests-oauthlib: 1.3.1 - resampy: 0.4.2 - rsa: 4.9 - s3fs: 2023.3.0 - scikit-learn: 1.1.2 - scipy: 1.9.1 - sentry-sdk: 1.16.0 - setproctitle: 1.3.2 - setuptools: 59.8.0 - six: 1.16.0 - smmap: 5.0.0 - soundfile: 0.11.0 - tensorboard: 2.10.1 - tensorboard-data-server: 0.6.1 - tensorboard-plugin-wit: 1.8.1 - threadpoolctl: 3.1.0 - tokenizers: 0.12.1 - torch: 1.13.1 - torchaudio: 0.12.1 - torchdata: 0.5.1 - torchmetrics: 0.9.3 - tqdm: 4.64.1 - transformers: 4.22.2 - typing-extensions: 4.3.0 - unidecode: 1.3.6 - urllib3: 1.26.12 - wandb: 0.13.11 - webdataset: 0.2.26 - werkzeug: 2.2.2 - wheel: 0.37.1 - wrapt: 1.15.0 - yarl: 1.8.1 - zipp: 3.8.1 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.9.13 - version: #23~20.04.1-Ubuntu SMP Thu Aug 18 03:20:14 UTC 2022 * CUDA: - GPU: - NVIDIA A100-SXM4-40GB - NVIDIA A100-SXM4-40GB - available: True - version: 11.7 * Lightning: - lightning-utilities: 0.8.0 - pytorch-lightning: 1.9.4 - torch: 1.13.1 - torchaudio: 0.12.1 - torchdata: 0.5.1 - torchmetrics: 0.9.3 * Packages: - absl-py: 1.2.0 - aiobotocore: 2.4.2 - aiohttp: 3.8.3 - aioitertools: 0.11.0 - aiosignal: 1.2.0 - appdirs: 1.4.4 - async-timeout: 4.0.2 - attrs: 22.1.0 - audioread: 3.0.0 - botocore: 1.27.59 - braceexpand: 0.1.7 - cachetools: 5.2.0 - certifi: 2022.9.24 - cffi: 1.15.1 - charset-normalizer: 2.1.1 - click: 8.1.3 - contourpy: 1.0.5 - cycler: 0.11.0 - decorator: 5.1.1 - deepspeed: 0.8.2 - docker-pycreds: 0.4.0 - filelock: 3.8.0 - fonttools: 4.37.4 - frozenlist: 1.3.1 - fsspec: 2023.3.0 - gitdb: 4.0.10 - gitpython: 3.1.31 - google-auth: 2.12.0 - google-auth-oauthlib: 0.4.6 - grpcio: 1.49.1 - hjson: 3.1.0 - huggingface-hub: 0.10.0 - idna: 3.4 - importlib-metadata: 5.0.0 - inflect: 6.0.0 - jmespath: 1.0.1 - joblib: 1.2.0 - kiwisolver: 1.4.4 - librosa: 0.9.2 - lightning-utilities: 0.8.0 - llvmlite: 0.39.1 - markdown: 3.4.1 - markupsafe: 2.1.1 - matplotlib: 3.6.0 - mkl-fft: 1.3.1 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - more-itertools: 8.14.0 - multidict: 6.0.2 - ninja: 1.11.1 - numba: 0.56.2 - numpy: 1.23.1 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: 8.5.0.96 - oauthlib: 3.2.1 - packaging: 21.3 - pathtools: 0.1.2 - pillow: 9.2.0 - pip: 22.2.2 - pooch: 1.6.0 - portalocker: 2.7.0 - protobuf: 3.19.6 - psutil: 5.9.4 - py-cpuinfo: 9.0.0 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pycparser: 2.21 - pydantic: 1.10.2 - pydeprecate: 0.3.2 - pyparsing: 3.0.9 - python-dateutil: 2.8.2 - pytorch-lightning: 1.9.4 - pyyaml: 6.0 - regex: 2022.9.13 - requests: 2.28.1 - requests-oauthlib: 1.3.1 - resampy: 0.4.2 - rsa: 4.9 - s3fs: 2023.3.0 - scikit-learn: 1.1.2 - scipy: 1.9.1 - sentry-sdk: 1.16.0 - setproctitle: 1.3.2 - setuptools: 59.8.0 - six: 1.16.0 - smmap: 5.0.0 - soundfile: 0.11.0 - tensorboard: 2.10.1 - tensorboard-data-server: 0.6.1 - tensorboard-plugin-wit: 1.8.1 - threadpoolctl: 3.1.0 - tokenizers: 0.12.1 - torch: 1.13.1 - torchaudio: 0.12.1 - torchdata: 0.5.1 - torchmetrics: 0.9.3 - tqdm: 4.64.1 - transformers: 4.22.2 - typing-extensions: 4.3.0 - unidecode: 1.3.6 - urllib3: 1.26.12 - wandb: 0.13.11 - webdataset: 0.2.26 - werkzeug: 2.2.2 - wheel: 0.37.1 - wrapt: 1.15.0 - yarl: 1.8.1 - zipp: 3.8.1 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.9.13 - version: #23~20.04.1-Ubuntu SMP Thu Aug 18 03:20:14 UTC 2022 ```

More info

The model is able to finish an epoch when (line 51) .sharding_filter()\ is removed, but this result in undesirable behavior, if turned off workers will return the same batch multiple time

    def _create_pipeline(self, data_dir):
        datapipe = torchdata.datapipes.iter.IterableWrapper(data_dir)\
            .shuffle()\
            .open_files_by_fsspec(mode='rb')\
            .load_from_tar() \
            .batch(2) \
            .map(self.to_sampels)

        return datapipe

cc @justusschock @awaelchli

carmocca commented 1 year ago

Torchdata requires extra setup and shutdown calls that Lightning doesn't do for you at the moment: https://github.com/Lightning-AI/lightning/issues/16603. This might be what's causing the issue.

So using torchdata with lightning is currently unexplored territory. It would be welcome if you find out what's wrong or want to contribute fixes to the integration.

knoriy commented 1 year ago

Thank you for the comment. I'll have a look into it. If I solve it or find anything meaningful, I'll open a pull request.

aleksmirosh commented 1 year ago

Hi @knoriy did you solve this? I have the same issue since march

carmocca commented 1 year ago

Ive seen issues that stem from using datapipes with the old dataloader class. Maybe using DataLoader2 from torchdata helps

aleksmirosh commented 1 year ago

@carmocca for me it crashed randomly after saving checkpoint, sometimes crashed, sometimes no.

knoriy commented 1 year ago

.sharding_filter()\
          .open_files_by_fsspec(mode='rb')\
          .load_from_tar() \

A workaround that's worked for me is to move sharding_filter below load_from_tar. It's not ideal because you are loading data without sharding. But fixed most of the issues.

Try this:

def _create_pipeline(self, data_dir):
        datapipe = torchdata.datapipes.iter.IterableWrapper(data_dir)\
            .shuffle()\
            .open_files_by_fsspec(mode='rb')\
            .load_from_tar() \
                .sharding_filter() \
            .batch(2) \
            .map(self.to_sampels)

        return datapipe

knoriy commented 1 year ago

Ive seen issues that stem from using datapipes with the old dataloader class. Maybe using DataLoader2 from torchdata helps

For me, dataloader2 causes issues when using Reading Services; it leads to freezing and worse performance. The classic dataloader worked best for me when using PL and TorchData.

carmocca commented 1 year ago

cc @ejguan

ejguan commented 1 year ago

I think the main problem is unbalanced data sharding across distributed ranks, which causes hanging. You can always attach a fullsync DataPipe at the end of your pipeline.

For me, dataloader2 causes issues when using Reading Services; it leads to freezing and worse performance. The classic dataloader worked best for me when using PL and TorchData.

Can you pls shed more light on this? In theory and based on our benchmarking, DataLoader2 should perform better than DataLoader.

knoriy commented 1 year ago

Can you pls shed more light on this? In theory and based on our benchmarking, DataLoader2 should perform better than DataLoader.

Thank you, I'll try adding fullsync with dataloader2.

Feel free to ask for anything I miss here:

The cluster manager is Slurm; using openmpi; PL version 1.9.x. The data is streamed from cloud storage using fsspec. dataloader2 uses both DistributedReadingService and MultiProcessingReadingService. I haven't tested these extensively, but from observations, DataLoader is about ~1.5x to ~2x faster; it seems to play better with PL, and scaling is more consistent. Adding more GPU when using DataLoader2 was slower for me.

Other observations:

.shuffle() placed after .load_from_tar() is extremely slow, reducing buffer_size helps.

@ejguan Does the order of reading services matter?

HarmanDotpy commented 1 year ago

I am having an issue of very slow training after something on the cluster I am using got updated, which I am trying to figure out with the admins, but I can see there are some difference in logs I am getting

in particular, I am receiving very similar logs as in this post, my nccl logs are:

NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws

NCCL INFO NET/OFI Configuring AWS-specific options

NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1

NCCL INFO NET/OFI Running on p4d.24xlarge platform, NCCL_TOPO_FILE environment variable is already set to /usr/local/cuda-11.3/efa/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml

NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)

however, previously I was getting the logs:

NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws

NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-11.3/efa/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml

NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1

NCCL INFO NET/OFI Selected Provider is efa

can the change in aws-ofi-nccl version from 1.4.0aws --> 1.5.0aws have caused the issue? also what does "(found 4 nics)" mean in the last line of the new logs above, it's something not present in the old logs?

knoriy commented 1 year ago

Update:

I've been stepping through the PL code, which looks to happen in the Closure class pytorch_lightning.loops.optimization.automatic: Closure. More specifically, line 137 and line 141 when self._result.loss is called.

Further notes and things that may help isolate this issue: sync_dist=True also causes the freezing

Lightning-AI / pytorch-lightning