i am using SLURMEnvironment plugin to resubmit jobs automatically. So far it has been working seamlessly
on my academic cluster, but recently when the auto-requeue signal is sent, the python script fails because of some multiprocessing error.
It appears to me that workers in the dataloader are not shut down correctly.
Setting num_workers=0 does not solve the issue, the same problem persists.
I couldn't really find anything online that addresses a similar issue, so I'd be glad to hear any tips on how to overcome this.
Thanks!
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
Epoch 1235: 100Handling auto-requeue signal: 1
Exception in thread Thread-3:
Traceback (most recent call last):
File "/path/to/env/envs/vhg-torch/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/path/to/env/envs/vhg-torch/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
do_one_step()
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd
fd = df.detach()
File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/connection.py", line 502, in Client
c = SocketClient(address)
File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/path/to/env/envs/vhg-torch/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pymp-l44wc1y0/listener-ha7esg4_'
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pymp-mou7am87/listener-k3a3wu_k'
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pymp-x0mbsgom/listener-k8nrogyx'
...
Traceback (most recent call last):
File "/my/dir/train_twin.py", line 146, in <module>
main(args)
File "/my/dir/train_twin.py", line 140, in main
trainer.fit(lightning_model, train_dataloader, val_dataloader, ckpt_path="last")
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
call._call_and_handle_interrupt(
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
results = self._run_stage()
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
self.fit_loop.run()
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
self.advance()
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
self.advance(data_fetcher)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 212, in advance
batch, _, __ = next(data_fetcher)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/loops/fetchers.py", line 133, in __next__
batch = super().__next__()
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/loops/fetchers.py", line 60, in __next__
batch = next(self.iterator)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__
out = next(self._iterator)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/utilities/combined_loader.py", line 78, in __next__
out[i] = next(self.iterators[i])
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
data = self._next_data()
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1285, in _get_data
success, data = self._try_get_data()
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/path/to/env/envs/vhg-torch/lib/python3.9/queue.py", line 180, in get
self.not_empty.wait(remaining)
File "/path/to/env/envs/vhg-torch/lib/python3.9/threading.py", line 316, in wait
gotit = waiter.acquire(True, timeout)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/signal_connector.py", line 33, in __call__
signal_handler(signum, frame)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/signal_connector.py", line 75, in _slurm_sigusr_handler_fn
self.trainer.save_checkpoint(hpc_save_path)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1365, in save_checkpoint
self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/pytorch/strategies/strategy.py", line 490, in save_checkpoint
self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/fabric/plugins/io/torch_io.py", line 58, in save_checkpoint
_atomic_save(checkpoint, path)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/lightning/fabric/utilities/cloud_io.py", line 89, in _atomic_save
with fs.transaction, fs.open(urlpath, "wb") as f:
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/fsspec/spec.py", line 1293, in open
f = self._open(
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/fsspec/implementations/local.py", line 184, in _open
return LocalFileOpener(path, mode, fs=self, **kwargs)
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/fsspec/implementations/local.py", line 306, in __init__
self._open()
File "/path/to/env/envs/vhg-torch/lib/python3.9/site-packages/fsspec/implementations/local.py", line 317, in _open
i, name = tempfile.mkstemp()
File "/path/to/env/envs/vhg-torch/lib/python3.9/tempfile.py", line 352, in mkstemp
return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/path/to/env/envs/vhg-torch/lib/python3.9/tempfile.py", line 255, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpuneur_2o'
Bug description
Hello,
i am using
SLURMEnvironment
plugin to resubmit jobs automatically. So far it has been working seamlessly on my academic cluster, but recently when the auto-requeue signal is sent, the python script fails because of some multiprocessing error.It appears to me that workers in the dataloader are not shut down correctly. Setting
num_workers=0
does not solve the issue, the same problem persists.I couldn't really find anything online that addresses a similar issue, so I'd be glad to hear any tips on how to overcome this. Thanks!
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
``` * CUDA: - GPU: - NVIDIA A40 - available: True - version: 12.1 * Lightning: - lightning: 2.4.0 - lightning-utilities: 0.11.7 - pytorch-lightning: 2.4.0 - torch: 2.3.1+cu121 - torchaudio: 2.3.1+cu121 - torchmetrics: 1.4.1 - torchvision: 0.18.1+cu121 * Packages: - absl-py: 2.1.0 - accelerate: 0.34.0 - addict: 2.4.0 - aiohappyeyeballs: 2.4.0 - aiohttp: 3.10.5 - aiosignal: 1.3.1 - antlr4-python3-runtime: 4.9.3 - anyio: 4.4.0 - argon2-cffi: 23.1.0 - argon2-cffi-bindings: 21.2.0 - arrow: 1.3.0 - asttokens: 2.4.1 - async-lru: 2.0.4 - async-timeout: 4.0.3 - attrs: 24.2.0 - autocommand: 2.2.2 - babel: 2.16.0 - backports.tarfile: 1.2.0 - beautifulsoup4: 4.12.3 - bleach: 6.1.0 - blinker: 1.8.2 - certifi: 2024.8.30 - cffi: 1.17.1 - charset-normalizer: 3.3.2 - click: 8.1.7 - comm: 0.2.2 - configargparse: 1.7 - contourpy: 1.3.0 - cycler: 0.12.1 - dash: 2.18.0 - dash-core-components: 2.0.0 - dash-html-components: 2.0.0 - dash-table: 5.0.0 - datasets: 2.21.0 - debugpy: 1.8.5 - decorator: 5.1.1 - defusedxml: 0.7.1 - diffusers: 0.30.2 - dill: 0.3.8 - docker-pycreds: 0.4.0 - einops: 0.8.0 - exceptiongroup: 1.2.2 - executing: 2.1.0 - fastjsonschema: 2.20.0 - filelock: 3.13.1 - flask: 3.0.3 - fonttools: 4.53.1 - fqdn: 1.5.1 - frozenlist: 1.4.1 - fsspec: 2024.2.0 - gitdb: 4.0.11 - gitpython: 3.1.43 - grpcio: 1.66.1 - gsplat: 1.3.0 - h11: 0.14.0 - h5py: 3.11.0 - httpcore: 1.0.5 - httpx: 0.27.2 - huggingface-hub: 0.24.6 - idna: 3.8 - importlib-metadata: 8.4.0 - importlib-resources: 6.4.4 - inflect: 7.3.1 - ipykernel: 6.29.5 - ipython: 8.18.1 - ipywidgets: 8.1.5 - isoduration: 20.11.0 - itsdangerous: 2.2.0 - jaraco.context: 5.3.0 - jaraco.functools: 4.0.1 - jaraco.text: 3.12.1 - jaxtyping: 0.2.34 - jedi: 0.19.1 - jinja2: 3.1.3 - joblib: 1.4.2 - json5: 0.9.25 - jsonpointer: 3.0.0 - jsonschema: 4.23.0 - jsonschema-specifications: 2023.12.1 - jupyter: 1.1.1 - jupyter-client: 8.6.2 - jupyter-console: 6.6.3 - jupyter-core: 5.7.2 - jupyter-events: 0.10.0 - jupyter-lsp: 2.2.5 - jupyter-server: 2.14.2 - jupyter-server-terminals: 0.5.3 - jupyterlab: 4.2.5 - jupyterlab-pygments: 0.3.0 - jupyterlab-server: 2.27.3 - jupyterlab-widgets: 3.0.13 - kiwisolver: 1.4.7 - lightning: 2.4.0 - lightning-utilities: 0.11.7 - markdown: 3.7 - markdown-it-py: 3.0.0 - markupsafe: 2.1.5 - matplotlib: 3.9.2 - matplotlib-inline: 0.1.7 - mdurl: 0.1.2 - mistune: 3.0.2 - more-itertools: 10.3.0 - mpmath: 1.3.0 - multidict: 6.0.5 - multiprocess: 0.70.16 - natsort: 8.4.0 - nbclient: 0.10.0 - nbconvert: 7.16.4 - nbformat: 5.10.4 - nest-asyncio: 1.6.0 - networkx: 3.2.1 - ninja: 1.11.1.1 - notebook: 7.2.2 - notebook-shim: 0.2.4 - numpy: 1.26.3 - nvdiffrast: 0.3.1 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.20.5 - nvidia-nvjitlink-cu12: 12.1.105 - nvidia-nvtx-cu12: 12.1.105 - omegaconf: 2.3.0 - open3d: 0.18.0 - opencv-python: 4.10.0.84 - overrides: 7.7.0 - packaging: 24.1 - pandas: 2.2.2 - pandocfilters: 1.5.1 - parso: 0.8.4 - peft: 0.12.0 - pexpect: 4.9.0 - pillow: 10.2.0 - pip: 24.2 - platformdirs: 4.2.2 - plotly: 5.24.0 - plyfile: 1.1 - prometheus-client: 0.20.0 - prompt-toolkit: 3.0.47 - protobuf: 3.20.3 - psutil: 6.0.0 - ptyprocess: 0.7.0 - pure-eval: 0.2.3 - pyarrow: 17.0.0 - pycparser: 2.22 - pygments: 2.18.0 - pyparsing: 3.1.4 - pyquaternion: 0.9.9 - python-dateutil: 2.9.0.post0 - python-json-logger: 2.0.7 - pytorch-lightning: 2.4.0 - pytz: 2024.1 - pyyaml: 6.0.2 - pyzmq: 26.2.0 - referencing: 0.35.1 - regex: 2024.7.24 - requests: 2.32.3 - retrying: 1.3.4 - rfc3339-validator: 0.1.4 - rfc3986-validator: 0.1.1 - rich: 13.8.0 - roma: 1.5.0 - rpds-py: 0.20.0 - safetensors: 0.4.4 - scikit-learn: 1.5.1 - scipy: 1.13.1 - send2trash: 1.8.3 - sentry-sdk: 2.13.0 - setproctitle: 1.3.3 - setuptools: 73.0.1 - six: 1.16.0 - smmap: 5.0.1 - sniffio: 1.3.1 - soupsieve: 2.6 - stack-data: 0.6.3 - sympy: 1.12 - tenacity: 9.0.0 - tensorboard: 2.17.1 - tensorboard-data-server: 0.7.2 - terminado: 0.18.1 - threadpoolctl: 3.5.0 - timm: 1.0.9 - tinycss2: 1.3.0 - tokenizers: 0.19.1 - tomli: 2.0.1 - torch: 2.3.1+cu121 - torchaudio: 2.3.1+cu121 - torchmetrics: 1.4.1 - torchvision: 0.18.1+cu121 - tornado: 6.4.1 - tqdm: 4.66.5 - traitlets: 5.14.3 - transformers: 4.44.2 - trimesh: 4.4.9 - triton: 2.3.1 - typeguard: 2.13.3 - types-python-dateutil: 2.9.0.20240821 - typing-extensions: 4.9.0 - tzdata: 2024.1 - uri-template: 1.3.0 - urllib3: 2.2.2 - virtualhumangen: 1.0.0 - wandb: 0.17.9 - wcwidth: 0.2.13 - webcolors: 24.8.0 - webencodings: 0.5.1 - websocket-client: 1.8.0 - werkzeug: 3.0.4 - wheel: 0.44.0 - widgetsnbextension: 4.0.13 - xxhash: 3.5.0 - yarl: 1.9.11 - zipp: 3.20.1 * System: - OS: Linux - architecture: - 64bit - ELF - processor: - python: 3.9.19 - release: 6.1.107.1.amd64-smp - version: #1 SMP PREEMPT_DYNAMIC Mon Sep 2 09:32:21 CEST 2024 ```More info
No response