Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.36k stars 3.39k forks source link

torchdata iterator not closed / shutdown not called --> process never exits #17642

Closed falckt closed 10 months ago

falckt commented 1 year ago

Bug description

When using tochdata DataPipes and Dataloader2 the iterator is not closed/ not shutdown. As a result the parent process never exits if a parallel reader is used.

A MWE that will never exit is given below.

I believe for the classic pytorch Dataloader this is handled in https://github.com/Lightning-AI/lightning/blob/682d7ef6b4f8ab412e523f161d95d3e88fbe58cf/src/lightning/pytorch/utilities/combined_loader.py#L317

What version are you seeing the problem on?

v2.0

How to reproduce the bug

from typing import Any

import torch
from pytorch_lightning import LightningModule, Trainer
from torchdata import datapipes as dp
from torchdata.dataloader2 import DataLoader2, MultiProcessingReadingService

class DummyModule(LightningModule):
    def __init__(self):
        super().__init__()
        self.mean = torch.nn.Parameter(torch.zeros(1))

    def forward(self, x):
        return x - self.mean

    def training_step(self, x):
        res = self(x)
        return res * res

    def configure_optimizers(self) -> Any:
        return torch.optim.SGD(self.parameters(), 0.1)

model = DummyModule()
dl = DataLoader2(
    dp.iter.IterableWrapper(range(100)).map(torch.tensor),
    reading_service=MultiProcessingReadingService(2),
)

trainer = Trainer(max_epochs=2)
trainer.fit(model, dl)

Error messages and logs

Training runs normally, but process never exists.

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
  warning_cache.warn(

  | Name | Type | Params
------------------------------
------------------------------
1         Trainable params
0         Non-trainable params
1         Total params
0.000     Total estimated model params size (MB)
Epoch 1: : 200it [00:00, 453.68it/s, v_num=7]`Trainer.fit` stopped: `max_epochs=2` reached.
Epoch 1: : 200it [00:00, 451.55it/s, v_num=7]
[rank: 0] Received SIGTERM: 15
[rank: 0] Received SIGTERM: 15

Environment

Test executed in docker container pytorch/pytorch:latest + pip install lightning

Current environment * CUDA: - GPU: None - available: False - version: 11.7 * Lightning: - lightning: 2.0.2 - lightning-cloud: 0.5.36 - lightning-utilities: 0.8.0 - pytorch-lightning: 2.0.2 - torch: 2.0.1 - torchaudio: 2.0.2 - torchdata: 0.6.1 - torchelastic: 0.2.2 - torchmetrics: 0.11.4 - torchtext: 0.15.2 - torchvision: 0.15.2 * Packages: - aiohttp: 3.8.4 - aiosignal: 1.3.1 - anyio: 3.6.2 - arrow: 1.2.3 - asttokens: 2.0.5 - astunparse: 1.6.3 - async-timeout: 4.0.2 - attrs: 23.1.0 - backcall: 0.2.0 - beautifulsoup4: 4.12.2 - blessed: 1.20.0 - boltons: 23.0.0 - brotlipy: 0.7.0 - certifi: 2023.5.7 - cffi: 1.15.1 - chardet: 4.0.0 - charset-normalizer: 2.0.4 - click: 8.1.3 - conda: 23.3.1 - conda-build: 3.24.0 - conda-content-trust: 0.1.3 - conda-package-handling: 2.0.2 - conda-package-streaming: 0.7.0 - croniter: 1.3.14 - cryptography: 39.0.1 - dateutils: 0.6.12 - decorator: 5.1.1 - deepdiff: 6.3.0 - dnspython: 2.3.0 - exceptiongroup: 1.1.1 - executing: 0.8.3 - expecttest: 0.1.4 - fastapi: 0.88.0 - filelock: 3.9.0 - frozenlist: 1.3.3 - fsspec: 2023.5.0 - glob2: 0.7 - gmpy2: 2.1.2 - h11: 0.14.0 - hypothesis: 6.75.2 - idna: 3.4 - inquirer: 3.1.3 - ipython: 8.12.0 - itsdangerous: 2.1.2 - jedi: 0.18.1 - jinja2: 3.1.2 - jsonpatch: 1.32 - jsonpointer: 2.1 - libarchive-c: 2.9 - lightning: 2.0.2 - lightning-cloud: 0.5.36 - lightning-utilities: 0.8.0 - markdown-it-py: 2.2.0 - markupsafe: 2.1.1 - matplotlib-inline: 0.1.6 - mdurl: 0.1.2 - mkl-fft: 1.3.6 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - mpmath: 1.3.0 - multidict: 6.0.4 - networkx: 3.1 - numpy: 1.24.3 - ordered-set: 4.1.0 - packaging: 23.0 - parso: 0.8.3 - pexpect: 4.8.0 - pickleshare: 0.7.5 - pillow: 9.4.0 - pip: 23.0.1 - pkginfo: 1.9.6 - pluggy: 1.0.0 - prompt-toolkit: 3.0.36 - psutil: 5.9.0 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - pycosat: 0.6.4 - pycparser: 2.21 - pydantic: 1.10.7 - pygments: 2.15.1 - pyjwt: 2.7.0 - pyopenssl: 23.0.0 - pysocks: 1.7.1 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-etcd: 0.4.5 - python-multipart: 0.0.6 - pytorch-lightning: 2.0.2 - pytz: 2022.7 - pyyaml: 6.0 - readchar: 4.0.5 - requests: 2.29.0 - rich: 13.3.5 - ruamel.yaml: 0.17.21 - ruamel.yaml.clib: 0.2.6 - setuptools: 65.6.3 - six: 1.16.0 - sniffio: 1.3.0 - sortedcontainers: 2.4.0 - soupsieve: 2.4 - stack-data: 0.2.0 - starlette: 0.22.0 - starsessions: 1.3.0 - sympy: 1.12 - tomli: 2.0.1 - toolz: 0.12.0 - torch: 2.0.1 - torchaudio: 2.0.2 - torchdata: 0.6.1 - torchelastic: 0.2.2 - torchmetrics: 0.11.4 - torchtext: 0.15.2 - torchvision: 0.15.2 - tqdm: 4.65.0 - traitlets: 5.7.1 - triton: 2.0.0 - types-dataclasses: 0.6.6 - typing-extensions: 4.5.0 - urllib3: 1.26.15 - uvicorn: 0.22.0 - wcwidth: 0.2.5 - websocket-client: 1.5.1 - websockets: 11.0.3 - wheel: 0.38.4 - yarl: 1.9.2 - zstandard: 0.19.0 * System: - OS: Linux - architecture: - 64bit - - processor: x86_64 - python: 3.10.11 - release: 5.19.0-1024-aws - version: #25~22.04.1-Ubuntu SMP Tue Apr 18 23:41:58 UTC 2023

More info

No response

zhengyanhe commented 1 year ago

I have the same issue. Is it related to dataloader shutdown? I tried to shutdown but it still hangs...

class DummyModule(LightningModule):
    def on_train_end(self):
        # shutdown both train/valid dataloader2

# or use datamodule to wrap Dataloader2

class MyDataModule(LightningDataModule):
    def teardown(self, stage):
         if stage == 'fit':
             # shutdown both train/valid dataloader2
awaelchli commented 10 months ago

Closing, since torchdata has stopped development, and so it is unlikely that Lightning will work on further support here.