Closed falckt closed 10 months ago
I have the same issue. Is it related to dataloader shutdown? I tried to shutdown but it still hangs...
class DummyModule(LightningModule):
def on_train_end(self):
# shutdown both train/valid dataloader2
# or use datamodule to wrap Dataloader2
class MyDataModule(LightningDataModule):
def teardown(self, stage):
if stage == 'fit':
# shutdown both train/valid dataloader2
Closing, since torchdata has stopped development, and so it is unlikely that Lightning will work on further support here.
Bug description
When using tochdata
the iterator is not closed/ not shutdown. As a result the parent process never exits if a parallel reader is used.A MWE that will never exit is given below.
I believe for the classic pytorch Dataloader this is handled in
What version are you seeing the problem on?
How to reproduce the bug
Error messages and logs
Training runs normally, but process never exists.
Test executed in docker container pytorch/pytorch:latest + pip install lightning
Current environment
* CUDA: - GPU: None - available: False - version: 11.7 * Lightning: - lightning: 2.0.2 - lightning-cloud: 0.5.36 - lightning-utilities: 0.8.0 - pytorch-lightning: 2.0.2 - torch: 2.0.1 - torchaudio: 2.0.2 - torchdata: 0.6.1 - torchelastic: 0.2.2 - torchmetrics: 0.11.4 - torchtext: 0.15.2 - torchvision: 0.15.2 * Packages: - aiohttp: 3.8.4 - aiosignal: 1.3.1 - anyio: 3.6.2 - arrow: 1.2.3 - asttokens: 2.0.5 - astunparse: 1.6.3 - async-timeout: 4.0.2 - attrs: 23.1.0 - backcall: 0.2.0 - beautifulsoup4: 4.12.2 - blessed: 1.20.0 - boltons: 23.0.0 - brotlipy: 0.7.0 - certifi: 2023.5.7 - cffi: 1.15.1 - chardet: 4.0.0 - charset-normalizer: 2.0.4 - click: 8.1.3 - conda: 23.3.1 - conda-build: 3.24.0 - conda-content-trust: 0.1.3 - conda-package-handling: 2.0.2 - conda-package-streaming: 0.7.0 - croniter: 1.3.14 - cryptography: 39.0.1 - dateutils: 0.6.12 - decorator: 5.1.1 - deepdiff: 6.3.0 - dnspython: 2.3.0 - exceptiongroup: 1.1.1 - executing: 0.8.3 - expecttest: 0.1.4 - fastapi: 0.88.0 - filelock: 3.9.0 - frozenlist: 1.3.3 - fsspec: 2023.5.0 - glob2: 0.7 - gmpy2: 2.1.2 - h11: 0.14.0 - hypothesis: 6.75.2 - idna: 3.4 - inquirer: 3.1.3 - ipython: 8.12.0 - itsdangerous: 2.1.2 - jedi: 0.18.1 - jinja2: 3.1.2 - jsonpatch: 1.32 - jsonpointer: 2.1 - libarchive-c: 2.9 - lightning: 2.0.2 - lightning-cloud: 0.5.36 - lightning-utilities: 0.8.0 - markdown-it-py: 2.2.0 - markupsafe: 2.1.1 - matplotlib-inline: 0.1.6 - mdurl: 0.1.2 - mkl-fft: 1.3.6 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - mpmath: 1.3.0 - multidict: 6.0.4 - networkx: 3.1 - numpy: 1.24.3 - ordered-set: 4.1.0 - packaging: 23.0 - parso: 0.8.3 - pexpect: 4.8.0 - pickleshare: 0.7.5 - pillow: 9.4.0 - pip: 23.0.1 - pkginfo: 1.9.6 - pluggy: 1.0.0 - prompt-toolkit: 3.0.36 - psutil: 5.9.0 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - pycosat: 0.6.4 - pycparser: 2.21 - pydantic: 1.10.7 - pygments: 2.15.1 - pyjwt: 2.7.0 - pyopenssl: 23.0.0 - pysocks: 1.7.1 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-etcd: 0.4.5 - python-multipart: 0.0.6 - pytorch-lightning: 2.0.2 - pytz: 2022.7 - pyyaml: 6.0 - readchar: 4.0.5 - requests: 2.29.0 - rich: 13.3.5 - ruamel.yaml: 0.17.21 - ruamel.yaml.clib: 0.2.6 - setuptools: 65.6.3 - six: 1.16.0 - sniffio: 1.3.0 - sortedcontainers: 2.4.0 - soupsieve: 2.4 - stack-data: 0.2.0 - starlette: 0.22.0 - starsessions: 1.3.0 - sympy: 1.12 - tomli: 2.0.1 - toolz: 0.12.0 - torch: 2.0.1 - torchaudio: 2.0.2 - torchdata: 0.6.1 - torchelastic: 0.2.2 - torchmetrics: 0.11.4 - torchtext: 0.15.2 - torchvision: 0.15.2 - tqdm: 4.65.0 - traitlets: 5.7.1 - triton: 2.0.0 - types-dataclasses: 0.6.6 - typing-extensions: 4.5.0 - urllib3: 1.26.15 - uvicorn: 0.22.0 - wcwidth: 0.2.5 - websocket-client: 1.5.1 - websockets: 11.0.3 - wheel: 0.38.4 - yarl: 1.9.2 - zstandard: 0.19.0 * System: - OS: Linux - architecture: - 64bit - - processor: x86_64 - python: 3.10.11 - release: 5.19.0-1024-aws - version: #25~22.04.1-Ubuntu SMP Tue Apr 18 23:41:58 UTC 2023More info
No response