Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.39k forks source link

Lightning v2.1.1 and above raises MultiProcessing RuntimeError: DataLoader worker (pid #) is killed by signal: Aborted during training #19302

Closed jponnetCytomine closed 9 months ago

jponnetCytomine commented 9 months ago

Bug description

When I have lightning v2.1.1 installed and try to train a RetinaNet model, I raises the following error:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/lib/python3.10/queue.py", line 180, in get
    self.not_empty.wait(remaining)
  File "/opt/conda/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 801) is killed by signal: Aborted. 

I am using a Docker Environment based on pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel and I am using the flag --ipc=host and I checked that I have enough shared memory which is the case (32GB in /dev/shm) as it never saturates. I can not set the workers number to 0 as the training would take too much time.

After hours of research, I could finally found a solution: downgrading lightning version to v2.1.0 Here is the environment that is not working:

Package                   Version
------------------------- ---------
absl-py                   2.0.0
aiohttp                   3.9.1
aiosignal                 1.3.1
alembic                   1.13.1
astroid                   3.0.2
asttokens                 2.0.5
astunparse                1.6.3
async-timeout             4.0.3
attrs                     23.1.0
backcall                  0.2.0
beautifulsoup4            4.12.2
blinker                   1.7.0
boltons                   23.0.0
brotlipy                  0.7.0
cachetools                5.3.2
certifi                   2023.5.7
cffi                      1.15.1
cfgv                      3.4.0
chardet                   4.0.0
charset-normalizer        2.0.4
click                     8.1.7
cloudpickle               2.2.1
conda                     23.3.1
conda-build               3.24.0
conda-content-trust       0.1.3
conda-package-handling    2.0.2
conda_package_streaming   0.7.0
contourpy                 1.2.0
cryptography              39.0.1
cycler                    0.12.1
databricks-cli            0.18.0
decorator                 5.1.1
dill                      0.3.7
distlib                   0.3.8
dnspython                 2.3.0
docker                    5.0.3
entrypoints               0.4
exceptiongroup            1.1.1
executing                 0.8.3
expecttest                0.1.4
filelock                  3.13.1
Flask                     2.3.3
fonttools                 4.47.2
frozenlist                1.4.1
fsspec                    2023.12.2
gitdb                     4.0.11
GitPython                 3.1.41
glob2                     0.7
gmpy2                     2.1.2
google-auth               2.26.2
google-auth-oauthlib      1.2.0
greenlet                  3.0.3
grpcio                    1.60.0
gunicorn                  20.1.0
hypothesis                6.75.2
identify                  2.5.33
idna                      3.4
importlib-metadata        4.13.0
ipython                   8.12.0
isort                     5.13.2
itsdangerous              2.1.2
jedi                      0.18.1
Jinja2                    3.1.2
jsonpatch                 1.32
jsonpointer               2.1
kiwisolver                1.4.5
libarchive-c              2.9
lightning                 2.1.2
lightning-utilities       0.10.0
Mako                      1.3.0
Markdown                  3.5.2
MarkupSafe                2.1.1
matplotlib                3.8.2
matplotlib-inline         0.1.6
mccabe                    0.7.0
mkl-fft                   1.3.6
mkl-random                1.2.2
mkl-service               2.4.0
mlflow                    1.28.0
mpmath                    1.3.0
multidict                 6.0.4
networkx                  2.8.4
nodeenv                   1.8.0
numpy                     1.24.3
oauthlib                  3.2.2
packaging                 21.3
pandas                    1.5.3
parso                     0.8.3
pexpect                   4.8.0
pickleshare               0.7.5
Pillow                    9.4.0
pip                       23.3.1
pkginfo                   1.9.6
platformdirs              4.1.0
pluggy                    1.0.0
pre-commit                3.6.0
prometheus-client         0.19.0
prometheus-flask-exporter 0.23.0
prompt-toolkit            3.0.36
protobuf                  4.23.4
psutil                    5.9.0
ptyprocess                0.7.0
pure-eval                 0.2.2
pyasn1                    0.5.1
pyasn1-modules            0.3.0
pycosat                   0.6.4
pycparser                 2.21
Pygments                  2.15.1
PyJWT                     2.8.0
pylint                    3.0.3
pyOpenSSL                 23.0.0
pyparsing                 3.1.1
PySocks                   1.7.1
python-dateutil           2.8.2
python-dotenv             1.0.0
python-etcd               0.4.5
pytorch-lightning         2.1.3
pytz                      2022.7
PyYAML                    6.0
querystring-parser        1.2.4
requests                  2.29.0
requests-oauthlib         1.3.1
rsa                       4.9
ruamel.yaml               0.17.21
ruamel.yaml.clib          0.2.6
scipy                     1.11.4
setuptools                65.6.3
shapely                   2.0.2
six                       1.16.0
smmap                     5.0.1
sortedcontainers          2.4.0
soupsieve                 2.4
SQLAlchemy                1.4.51
sqlparse                  0.4.4
stack-data                0.2.0
sympy                     1.11.1
tabulate                  0.9.0
tensorboard               2.15.1
tensorboard-data-server   0.7.2
tomli                     2.0.1
tomlkit                   0.12.3
toolz                     0.12.0
torch                     2.0.1
torchaudio                2.0.2
torchdata                 0.6.1
torchelastic              0.2.2
torchmetrics              1.3.0
torchtext                 0.15.2
torchvision               0.15.2
tqdm                      4.65.0
traitlets                 5.7.1
triton                    2.0.0
types-dataclasses         0.6.6
typing_extensions         4.5.0
urllib3                   1.26.15
virtualenv                20.25.0
wcwidth                   0.2.5
websocket-client          1.7.0
Werkzeug                  3.0.1
wheel                     0.38.4
yarl                      1.9.4
zipp                      3.17.0
zstandard                 0.19.0

and if I make pip install lightning==2.1.0 then this error is not raised anymore and I can train my model. Here is how I run my docker: docker build --gpus "all" --ipc=host --ulimit memlock=-1 --rm -it my_image:tag

What version are you seeing the problem on?

v2.1, master

How to reproduce the bug

No response

Error messages and logs

Error messages and logs here please

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/lib/python3.10/queue.py", line 180, in get
    self.not_empty.wait(remaining)
  File "/opt/conda/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 801) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/wp2/src/train.py", line 277, in <module>
    main(sys.argv[1:])
  File "/workspace/wp2/src/train.py", line 272, in main
    trainer.fit(retinanet, train_dataloaders=train_loader, val_dataloaders=val_loaders)
  File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run
    self.advance(data_fetcher)
  File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 202, in advance
    batch, _, __ = next(data_fetcher)
  File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 127, in __next__
    batch = super().__next__()
  File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 56, in __next__
    batch = next(self.iterator)
  File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 326, in __next__
    out = next(self._iterator)
  File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 74, in __next__
    out[i] = next(self.iterators[i])
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
    idx, data = self._get_data()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1284, in _get_data
    success, data = self._try_get_data()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1145, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 801) exited unexpectedly
Exception in thread Thread-8 (_pin_memory_loop):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in _pin_memory_loop
    do_one_step()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 307, in rebuild_storage_fd
    fd = df.detach()
  File "/opt/conda/lib/python3.10/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/opt/conda/lib/python3.10/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 508, in Client
    answer_challenge(c, authkey)
  File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 752, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
^CException ignored in atexit callback: <function _exit_function at 0x7efdf85056c0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/util.py", line 357, in _exit_function
    p.join()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/opt/conda/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/opt/conda/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)

Environment

Current environment ``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): LightningApp #- PyTorch Lightning Version (e.g., 1.5.0): 2.1.3 #- Lightning App Version (e.g., 0.5.2): 2.1.2 #- PyTorch Version (e.g., 2.0): 2.0.1 #- Python version (e.g., 3.9): 3.10 #- OS (e.g., Linux): Linux Ubuntu #- How you installed Lightning(`conda`, `pip`, source): pip lightning==2.1.2 #- Running environment of LightningApp (e.g. local, cloud): docker env based on `pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel `

More info

No response

cc @justusschock @awaelchli

awaelchli commented 9 months ago

@jponnetCytomine I can't find any commits related to dataloading in Lightning Trainer for 2.1.1.

Connection reset by peer means the dataloader worker was killed by something externally. Have you allocated enough shared memory in your docker? I suspect the "downgrading to 2.1.0 is a solution" might just be a red herring.

jponnetCytomine commented 9 months ago

@awaelchli yes I have even tried with 256GB of shared memory and it was still not working. Here is what I get in the docker when doing df -h with 32GB of shared memory from --ipc=host:

root@14f601271aa6:/workspace# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay         491G  132G  334G  29% /
tmpfs            64M     0   64M   0% /dev
/dev/vda1       491G  132G  334G  29% /workspace
tmpfs            32G     0   32G   0% /dev/shm
tmpfs            16G   12K   16G   1% /proc/driver/nvidia
udev             16G     0   16G   0% /dev/nvidia0
tmpfs            16G     0   16G   0% /proc/asound
tmpfs            16G     0   16G   0% /proc/acpi

Note that I also tried running my docker run command with --shm-size and not --ipc but it did not fixed anything.

awaelchli commented 9 months ago

Maybe you are hitting this ominous bug in PyTorch that nobody ever knew how to resolve: https://github.com/Lightning-AI/torchmetrics/issues/1560.

At this point I can only guess. So maybe try a few things like setting persistent_workers=True/False and pin_memory=True/False.

awaelchli commented 9 months ago

Did changing any of these options help?

jponnetCytomine commented 9 months ago

Hello, yes now when I do pin_memory = False it works but is it a good thing to keep pin_memory to False? Thank you !

awaelchli commented 9 months ago

The default in PyTorch is False. It's a bit of a mysterious feature and I don't really know much about its best practice. I've never seen a big impact of enabling this in practice.

awaelchli commented 9 months ago

Ok, closing this now since this was unrelated to Lightning. Sorry I couldn't give a clear answer about pin_memory, but turning it off shouldn't impact you negatively.