Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.3k stars 3.38k forks source link

SLURM resubmission crashes because of multiprocessing error #20280

Open antonzub99 opened 1 month ago

antonzub99 commented 1 month ago

Bug description

Hello,

i am using SLURMEnvironment plugin to resubmit jobs automatically. So far it has been working seamlessly on my academic cluster, but recently when the auto-requeue signal is sent, the python script fails because of some multiprocessing error.

It appears to me that workers in the dataloader are not shut down correctly. Setting num_workers=0 does not solve the issue, the same problem persists.

I couldn't really find anything online that addresses a similar issue, so I'd be glad to hear any tips on how to overcome this. Thanks!

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

Epoch 1235: 100Handling auto-requeue signal: 1
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/path/to/env/envs/vhg-torch/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
    do_one_step()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd
    fd = df.detach()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/connection.py", line 502, in Client
    c = SocketClient(address)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
    s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pymp-l44wc1y0/listener-ha7esg4_'
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pymp-mou7am87/listener-k3a3wu_k'
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pymp-x0mbsgom/listener-k8nrogyx'
...
Traceback (most recent call last):
  File "/my/dir/train_twin.py", line 146, in <module>
    main(args)
  File "/my/dir/train_twin.py", line 140, in main
    trainer.fit(lightning_model, train_dataloader, val_dataloader, ckpt_path="last")
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
    call._call_and_handle_interrupt(
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 212, in advance
    batch, _, __ = next(data_fetcher)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/loops/fetchers.py", line 133, in __next__
    batch = super().__next__()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/utilities/combined_loader.py", line 78, in __next__
    out[i] = next(self.iterators[i])
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
    idx, data = self._get_data()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1285, in _get_data
    success, data = self._try_get_data()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/queue.py", line 180, in get
    self.not_empty.wait(remaining)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/threading.py", line 316, in wait
    gotit = waiter.acquire(True, timeout)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/signal_connector.py", line 33, in __call__
    signal_handler(signum, frame)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/signal_connector.py", line 75, in _slurm_sigusr_handler_fn
    self.trainer.save_checkpoint(hpc_save_path)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1365, in save_checkpoint
    self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/strategies/strategy.py", line 490, in save_checkpoint
    self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/fabric/plugins/io/torch_io.py", line 58, in save_checkpoint
    _atomic_save(checkpoint, path)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/fabric/utilities/cloud_io.py", line 89, in _atomic_save
    with fs.transaction, fs.open(urlpath, "wb") as f:
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/fsspec/spec.py", line 1293, in open
    f = self._open(
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/fsspec/implementations/local.py", line 184, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/fsspec/implementations/local.py", line 306, in __init__
    self._open()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/fsspec/implementations/local.py", line 317, in _open
    i, name = tempfile.mkstemp()
  File "/path/to/env/envs/vhg-torch/lib/python3.9/tempfile.py", line 352, in mkstemp
    return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/path/to/env/envs/vhg-torch/lib/python3.9/tempfile.py", line 255, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpuneur_2o'

Environment

Current environment ``` * CUDA: - GPU: - NVIDIA A40 - available: True - version: 12.1 * Lightning: - lightning: 2.4.0 - lightning-utilities: 0.11.7 - pytorch-lightning: 2.4.0 - torch: 2.3.1+cu121 - torchaudio: 2.3.1+cu121 - torchmetrics: 1.4.1 - torchvision: 0.18.1+cu121 * Packages: - absl-py: 2.1.0 - accelerate: 0.34.0 - addict: 2.4.0 - aiohappyeyeballs: 2.4.0 - aiohttp: 3.10.5 - aiosignal: 1.3.1 - antlr4-python3-runtime: 4.9.3 - anyio: 4.4.0 - argon2-cffi: 23.1.0 - argon2-cffi-bindings: 21.2.0 - arrow: 1.3.0 - asttokens: 2.4.1 - async-lru: 2.0.4 - async-timeout: 4.0.3 - attrs: 24.2.0 - autocommand: 2.2.2 - babel: 2.16.0 - backports.tarfile: 1.2.0 - beautifulsoup4: 4.12.3 - bleach: 6.1.0 - blinker: 1.8.2 - certifi: 2024.8.30 - cffi: 1.17.1 - charset-normalizer: 3.3.2 - click: 8.1.7 - comm: 0.2.2 - configargparse: 1.7 - contourpy: 1.3.0 - cycler: 0.12.1 - dash: 2.18.0 - dash-core-components: 2.0.0 - dash-html-components: 2.0.0 - dash-table: 5.0.0 - datasets: 2.21.0 - debugpy: 1.8.5 - decorator: 5.1.1 - defusedxml: 0.7.1 - diffusers: 0.30.2 - dill: 0.3.8 - docker-pycreds: 0.4.0 - einops: 0.8.0 - exceptiongroup: 1.2.2 - executing: 2.1.0 - fastjsonschema: 2.20.0 - filelock: 3.13.1 - flask: 3.0.3 - fonttools: 4.53.1 - fqdn: 1.5.1 - frozenlist: 1.4.1 - fsspec: 2024.2.0 - gitdb: 4.0.11 - gitpython: 3.1.43 - grpcio: 1.66.1 - gsplat: 1.3.0 - h11: 0.14.0 - h5py: 3.11.0 - httpcore: 1.0.5 - httpx: 0.27.2 - huggingface-hub: 0.24.6 - idna: 3.8 - importlib-metadata: 8.4.0 - importlib-resources: 6.4.4 - inflect: 7.3.1 - ipykernel: 6.29.5 - ipython: 8.18.1 - ipywidgets: 8.1.5 - isoduration: 20.11.0 - itsdangerous: 2.2.0 - jaraco.context: 5.3.0 - jaraco.functools: 4.0.1 - jaraco.text: 3.12.1 - jaxtyping: 0.2.34 - jedi: 0.19.1 - jinja2: 3.1.3 - joblib: 1.4.2 - json5: 0.9.25 - jsonpointer: 3.0.0 - jsonschema: 4.23.0 - jsonschema-specifications: 2023.12.1 - jupyter: 1.1.1 - jupyter-client: 8.6.2 - jupyter-console: 6.6.3 - jupyter-core: 5.7.2 - jupyter-events: 0.10.0 - jupyter-lsp: 2.2.5 - jupyter-server: 2.14.2 - jupyter-server-terminals: 0.5.3 - jupyterlab: 4.2.5 - jupyterlab-pygments: 0.3.0 - jupyterlab-server: 2.27.3 - jupyterlab-widgets: 3.0.13 - kiwisolver: 1.4.7 - lightning: 2.4.0 - lightning-utilities: 0.11.7 - markdown: 3.7 - markdown-it-py: 3.0.0 - markupsafe: 2.1.5 - matplotlib: 3.9.2 - matplotlib-inline: 0.1.7 - mdurl: 0.1.2 - mistune: 3.0.2 - more-itertools: 10.3.0 - mpmath: 1.3.0 - multidict: 6.0.5 - multiprocess: 0.70.16 - natsort: 8.4.0 - nbclient: 0.10.0 - nbconvert: 7.16.4 - nbformat: 5.10.4 - nest-asyncio: 1.6.0 - networkx: 3.2.1 - ninja: 1.11.1.1 - notebook: 7.2.2 - notebook-shim: 0.2.4 - numpy: 1.26.3 - nvdiffrast: 0.3.1 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.20.5 - nvidia-nvjitlink-cu12: 12.1.105 - nvidia-nvtx-cu12: 12.1.105 - omegaconf: 2.3.0 - open3d: 0.18.0 - opencv-python: 4.10.0.84 - overrides: 7.7.0 - packaging: 24.1 - pandas: 2.2.2 - pandocfilters: 1.5.1 - parso: 0.8.4 - peft: 0.12.0 - pexpect: 4.9.0 - pillow: 10.2.0 - pip: 24.2 - platformdirs: 4.2.2 - plotly: 5.24.0 - plyfile: 1.1 - prometheus-client: 0.20.0 - prompt-toolkit: 3.0.47 - protobuf: 3.20.3 - psutil: 6.0.0 - ptyprocess: 0.7.0 - pure-eval: 0.2.3 - pyarrow: 17.0.0 - pycparser: 2.22 - pygments: 2.18.0 - pyparsing: 3.1.4 - pyquaternion: 0.9.9 - python-dateutil: 2.9.0.post0 - python-json-logger: 2.0.7 - pytorch-lightning: 2.4.0 - pytz: 2024.1 - pyyaml: 6.0.2 - pyzmq: 26.2.0 - referencing: 0.35.1 - regex: 2024.7.24 - requests: 2.32.3 - retrying: 1.3.4 - rfc3339-validator: 0.1.4 - rfc3986-validator: 0.1.1 - rich: 13.8.0 - roma: 1.5.0 - rpds-py: 0.20.0 - safetensors: 0.4.4 - scikit-learn: 1.5.1 - scipy: 1.13.1 - send2trash: 1.8.3 - sentry-sdk: 2.13.0 - setproctitle: 1.3.3 - setuptools: 73.0.1 - six: 1.16.0 - smmap: 5.0.1 - sniffio: 1.3.1 - soupsieve: 2.6 - stack-data: 0.6.3 - sympy: 1.12 - tenacity: 9.0.0 - tensorboard: 2.17.1 - tensorboard-data-server: 0.7.2 - terminado: 0.18.1 - threadpoolctl: 3.5.0 - timm: 1.0.9 - tinycss2: 1.3.0 - tokenizers: 0.19.1 - tomli: 2.0.1 - torch: 2.3.1+cu121 - torchaudio: 2.3.1+cu121 - torchmetrics: 1.4.1 - torchvision: 0.18.1+cu121 - tornado: 6.4.1 - tqdm: 4.66.5 - traitlets: 5.14.3 - transformers: 4.44.2 - trimesh: 4.4.9 - triton: 2.3.1 - typeguard: 2.13.3 - types-python-dateutil: 2.9.0.20240821 - typing-extensions: 4.9.0 - tzdata: 2024.1 - uri-template: 1.3.0 - urllib3: 2.2.2 - virtualhumangen: 1.0.0 - wandb: 0.17.9 - wcwidth: 0.2.13 - webcolors: 24.8.0 - webencodings: 0.5.1 - websocket-client: 1.8.0 - werkzeug: 3.0.4 - wheel: 0.44.0 - widgetsnbextension: 4.0.13 - xxhash: 3.5.0 - yarl: 1.9.11 - zipp: 3.20.1 * System: - OS: Linux - architecture: - 64bit - ELF - processor: - python: 3.9.19 - release: 6.1.107.1.amd64-smp - version: #1 SMP PREEMPT_DYNAMIC Mon Sep 2 09:32:21 CEST 2024 ```

More info

No response

NiccoloCavagnero commented 5 days ago

Exactly same error here.

Any updates?