Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.09k stars 3.36k forks source link

Script freezes when Trainer is instantiated #19768

Closed PabloVD closed 3 months ago

PabloVD commented 5 months ago

Bug description

I can run once a training script with pytorch-lightning. However, after the training finishes, if train to run it again, the code freezes when the L.Trainer is instantiated. There are no error messages.

Only if I shutdown and restart, I can run it once again, but then the problem persist for the next time.

This happens to me with different codes, even in the "lightning in 15 minutes" example.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

# Based on https://lightning.ai/docs/pytorch/stable/starter/introduction.html

import os
import torch
from torch import optim, nn, utils
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import pytorch_lightning as L

# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))

# define the LightningModule
class LitAutoEncoder(L.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        x_hat = self.model_forward(x)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard (if installed) by default
        self.log("train_loss", loss)
        return batch

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

# init the autoencoder
autoencoder = LitAutoEncoder(encoder, decoder)

# setup data
dataset = MNIST(os.getcwd(), download=True, train=True, transform=ToTensor())
# use 20% of training data for validation
train_set_size = int(len(dataset) * 0.8)
valid_set_size = len(dataset) - train_set_size
seed = torch.Generator().manual_seed(42)
train_set, val_set = utils.data.random_split(dataset, [train_set_size, valid_set_size], generator=seed)
train_loader = utils.data.DataLoader(train_set, num_workers=15)
valid_loader = utils.data.DataLoader(val_set, num_workers=15)

print("Before instantiate Trainer")
# train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
trainer = L.Trainer(limit_train_batches=100, max_epochs=10, check_val_every_n_epoch=10, accelerator="gpu")
print("After instantiate Trainer")

Error messages and logs

There are no error messages

Environment

Current environment * CUDA: - GPU: - NVIDIA GeForce RTX 3080 Laptop GPU - available: True - version: 12.1 * Lightning: - denoising-diffusion-pytorch: 1.5.4 - ema-pytorch: 0.2.1 - lightning-utilities: 0.11.2 - pytorch-fid: 0.3.0 - pytorch-lightning: 2.2.2 - torch: 2.2.2 - torchaudio: 2.2.2 - torchmetrics: 1.0.0 - torchvision: 0.17.2 * Packages: - absl-py: 1.4.0 - accelerate: 0.17.1 - addict: 2.4.0 - aiohttp: 3.8.3 - aiosignal: 1.2.0 - antlr4-python3-runtime: 4.9.3 - anyio: 3.6.1 - appdirs: 1.4.4 - argon2-cffi: 21.3.0 - argon2-cffi-bindings: 21.2.0 - array-record: 0.4.0 - arrow: 1.2.3 - astropy: 5.2.1 - asttokens: 2.0.8 - astunparse: 1.6.3 - async-timeout: 4.0.2 - attrs: 23.1.0 - auditwheel: 5.4.0 - babel: 2.10.3 - backcall: 0.2.0 - beautifulsoup4: 4.11.1 - bleach: 5.0.1 - blinker: 1.6.2 - bqplot: 0.12.40 - branca: 0.6.0 - build: 1.2.1 - cachetools: 5.2.0 - carla: 0.9.14 - certifi: 2024.2.2 - cffi: 1.15.1 - chardet: 5.1.0 - charset-normalizer: 2.1.1 - click: 8.1.3 - click-plugins: 1.1.1 - cligj: 0.7.2 - cloudpickle: 3.0.0 - cmake: 3.26.1 - colossus: 1.3.1 - colour: 0.1.5 - contourpy: 1.0.7 - cycler: 0.11.0 - cython: 0.29.32 - dacite: 1.8.1 - dask: 2023.3.1 - dataclass-array: 1.4.1 - debugpy: 1.6.3 - decorator: 4.4.2 - deepspeed: 0.7.2 - defusedxml: 0.7.1 - denoising-diffusion-pytorch: 1.5.4 - deprecation: 2.1.0 - dill: 0.3.6 - distlib: 0.3.6 - dm-tree: 0.1.8 - docker-pycreds: 0.4.0 - docstring-parser: 0.15 - einops: 0.6.0 - einsum: 0.3.0 - ema-pytorch: 0.2.1 - etils: 1.3.0 - exceptiongroup: 1.2.0 - executing: 1.0.0 - farama-notifications: 0.0.4 - fastjsonschema: 2.16.1 - filelock: 3.8.0 - fiona: 1.9.3 - flask: 2.3.3 - flatbuffers: 24.3.25 - folium: 0.14.0 - fonttools: 4.37.1 - frozenlist: 1.3.1 - fsspec: 2022.8.2 - future: 1.0.0 - fvcore: 0.1.5.post20221221 - gast: 0.4.0 - gdown: 4.7.1 - geojson: 3.0.1 - geopandas: 0.12.2 - gitdb: 4.0.11 - gitpython: 3.1.43 - google-auth: 2.16.2 - google-auth-oauthlib: 0.4.6 - google-pasta: 0.2.0 - googleapis-common-protos: 1.63.0 - googledrivedownloader: 0.4 - gputil: 1.4.0 - gpxpy: 1.5.0 - grpcio: 1.62.1 - gunicorn: 20.0.4 - gym: 0.26.2 - gym-notices: 0.0.8 - gymnasium: 0.28.1 - h5py: 3.7.0 - haversine: 2.8.0 - hdf5plugin: 4.1.1 - hjson: 3.1.0 - humanfriendly: 10.0 - idna: 3.6 - imageio: 2.31.3 - imageio-ffmpeg: 0.4.7 - immutabledict: 2.2.0 - importlib-metadata: 4.12.0 - importlib-resources: 6.1.0 - imutils: 0.5.4 - invertedai: 0.0.8.post1 - iopath: 0.1.10 - ipyevents: 2.0.2 - ipyfilechooser: 0.6.0 - ipykernel: 6.15.3 - ipyleaflet: 0.17.4 - ipython: 8.5.0 - ipython-genutils: 0.2.0 - ipytree: 0.2.2 - ipywidgets: 8.0.2 - itsdangerous: 2.1.2 - jax-jumpy: 1.0.0 - jedi: 0.18.1 - jinja2: 3.1.2 - joblib: 1.4.0 - jplephem: 2.19 - json5: 0.9.10 - jsonargparse: 4.15.0 - jsonschema: 4.19.1 - jsonschema-specifications: 2023.7.1 - jstyleson: 0.0.2 - julia: 0.6.1 - jupyter: 1.0.0 - jupyter-client: 7.3.5 - jupyter-console: 6.4.4 - jupyter-core: 4.11.1 - jupyter-packaging: 0.12.3 - jupyter-server: 1.18.1 - jupyterlab: 3.4.7 - jupyterlab-pygments: 0.2.2 - jupyterlab-server: 2.15.1 - jupyterlab-widgets: 3.0.3 - keras: 2.11.0 - kiwisolver: 1.4.4 - lanelet2: 1.2.1 - lark: 1.1.9 - lazy-loader: 0.2 - leafmap: 0.27.0 - libclang: 14.0.6 - lightning-utilities: 0.11.2 - lit: 16.0.0 - llvmlite: 0.39.1 - locket: 1.0.0 - lunarsky: 0.2.1 - lxml: 4.9.1 - lz4: 4.3.3 - markdown: 3.4.1 - markdown-it-py: 2.2.0 - markupsafe: 2.1.1 - matplotlib: 3.6.1 - matplotlib-inline: 0.1.6 - mdurl: 0.1.2 - mistune: 2.0.4 - moviepy: 1.0.3 - mpi4py: 3.1.3 - mpmath: 1.3.0 - msgpack: 1.0.8 - multidict: 6.0.2 - munch: 2.5.0 - natsort: 8.2.0 - nbclassic: 0.4.3 - nbclient: 0.6.8 - nbconvert: 7.0.0 - nbformat: 5.5.0 - nest-asyncio: 1.5.5 - networkx: 2.8.6 - ninja: 1.10.2.3 - notebook: 6.4.12 - notebook-shim: 0.1.0 - numba: 0.56.4 - numpy: 1.24.4 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu11: 11.7.101 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu11: 8.5.0.96 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu11: 10.9.0.58 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu11: 10.2.10.91 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu11: 11.4.0.1 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu11: 11.7.4.91 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu11: 2.14.3 - nvidia-nccl-cu12: 2.19.3 - nvidia-nvjitlink-cu12: 12.4.127 - nvidia-nvtx-cu11: 11.7.91 - nvidia-nvtx-cu12: 12.1.105 - oauthlib: 3.2.2 - omegaconf: 2.3.0 - open-humans-api: 0.2.9 - opencv-python: 4.6.0.66 - openexr: 1.3.9 - opt-einsum: 3.3.0 - osmnx: 1.2.2 - p5py: 1.0.0 - packaging: 21.3 - pandas: 1.5.3 - pandocfilters: 1.5.0 - parso: 0.8.3 - partd: 1.4.1 - pep517: 0.13.0 - pickleshare: 0.7.5 - pillow: 9.2.0 - pint: 0.21.1 - pip: 24.0 - pkgconfig: 1.5.5 - pkgutil-resolve-name: 1.3.10 - platformdirs: 2.5.2 - plotly: 5.13.1 - plyfile: 0.8.1 - portalocker: 2.8.2 - powerbox: 0.7.1 - prettymapp: 0.1.0 - proglog: 0.1.10 - prometheus-client: 0.14.1 - promise: 2.3 - prompt-toolkit: 3.0.31 - protobuf: 3.19.6 - psutil: 5.9.2 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - py-cpuinfo: 8.0.0 - pyarrow: 10.0.0 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pycocotools: 2.0 - pycosat: 0.6.3 - pycparser: 2.21 - pydantic: 1.10.9 - pydeprecate: 0.3.1 - pydub: 0.25.1 - pyelftools: 0.30 - pyerfa: 2.0.0.1 - pyfftw: 0.13.1 - pygame: 2.1.2 - pygments: 2.13.0 - pylians: 0.7 - pyparsing: 3.0.9 - pyproj: 3.5.0 - pyproject-hooks: 1.0.0 - pyquaternion: 0.9.9 - pyrsistent: 0.18.1 - pyshp: 2.3.1 - pysocks: 1.7.1 - pysr: 0.16.3 - pystac: 1.8.4 - pystac-client: 0.7.5 - python-box: 7.1.1 - python-dateutil: 2.8.2 - pytorch-fid: 0.3.0 - pytorch-lightning: 2.2.2 - pytz: 2022.2.1 - pywavelets: 1.4.1 - pyyaml: 6.0 - pyzmq: 23.2.1 - qtconsole: 5.3.2 - qtpy: 2.2.0 - ray: 2.10.0 - referencing: 0.30.2 - requests: 2.31.0 - requests-oauthlib: 1.3.1 - rich: 13.3.4 - rpds-py: 0.10.3 - rsa: 4.9 - rtree: 1.0.1 - ruamel.yaml: 0.17.21 - ruamel.yaml.clib: 0.2.7 - scikit-build-core: 0.8.2 - scikit-image: 0.20.0 - scikit-learn: 1.2.2 - scipy: 1.8.1 - scooby: 0.7.4 - seaborn: 0.12.2 - send2trash: 1.8.0 - sentry-sdk: 1.44.1 - setproctitle: 1.3.3 - setuptools: 67.6.0 - shapely: 1.8.0 - shellingham: 1.5.4 - six: 1.16.0 - sklearn: 0.0.post1 - smmap: 5.0.1 - sniffio: 1.3.0 - soupsieve: 2.3.2.post1 - spiceypy: 6.0.0 - stack-data: 0.5.0 - stravalib: 1.4 - swagger-client: 1.0.0 - sympy: 1.11.1 - tabulate: 0.9.0 - taichi: 1.5.0 - tenacity: 8.2.3 - tensorboard: 2.11.2 - tensorboard-data-server: 0.6.1 - tensorboard-plugin-wit: 1.8.1 - tensorboardx: 2.6.2.2 - tensorflow: 2.11.0 - tensorflow-addons: 0.21.0 - tensorflow-datasets: 4.9.0 - tensorflow-estimator: 2.11.0 - tensorflow-graphics: 2021.12.3 - tensorflow-io-gcs-filesystem: 0.29.0 - tensorflow-metadata: 1.13.0 - tensorflow-probability: 0.19.0 - termcolor: 2.1.1 - terminado: 0.15.0 - threadpoolctl: 3.1.0 - tifffile: 2023.3.21 - timm: 0.4.12 - tinycss2: 1.1.1 - toml: 0.10.2 - tomli: 2.0.1 - tomlkit: 0.11.4 - toolz: 0.12.1 - torch: 2.2.2 - torchaudio: 2.2.2 - torchmetrics: 1.0.0 - torchvision: 0.17.2 - tornado: 6.2 - tqdm: 4.66.2 - tr: 1.0.0.2 - trafficgen: 0.0.0 - traitlets: 5.4.0 - traittypes: 0.2.1 - trimesh: 4.3.0 - triton: 2.2.0 - typeguard: 2.13.3 - typer: 0.12.2 - typing-extensions: 4.11.0 - urllib3: 1.26.15 - virtualenv: 20.16.5 - visu3d: 1.5.1 - wandb: 0.16.5 - waymo-open-dataset-tf-2-11-0: 1.6.1 - wcwidth: 0.2.5 - webencodings: 0.5.1 - websocket-client: 1.4.1 - werkzeug: 2.3.7 - wheel: 0.37.1 - whitebox: 2.3.1 - whiteboxgui: 2.3.0 - widgetsnbextension: 4.0.3 - wrapt: 1.14.1 - xyzservices: 2023.7.0 - yacs: 0.1.8 - yapf: 0.30.0 - yarl: 1.8.1 - zipp: 3.8.1 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.8.19 - release: 5.15.0-102-generic - version: #112~20.04.1-Ubuntu SMP Thu Mar 14 14:28:24 UTC 2024

More info

No response

PabloVD commented 4 months ago

A couple of updates on my issue:

print("Before instantiate Trainer") trainer = L.Trainer() print("After instantiate Trainer")



- The same issue also occurs in a different machine, a remote server with Ubuntu 20.04, even with the above super simple example. I have tried with different versions of torch and lightning, and happens the same in all of them.

Does anybody know what is going on?
PabloVD commented 4 months ago

Another update: the program does not get to output the info regarding available GPU, TPU etc, so it freezes before that. To check when exactly, I put some prints inside the lightning.Trainer init and I found that it gets stuck just in the line self._accelerator_connector = _AcceleratorConnector, so it may be causing the issue, but not sure which is exactly the problem.

v-ngangarapu commented 4 months ago

I faced similar issue. It actually get freeze for thread lock to release which doesn't exist. After downgrading version of Python from 3.11 to 3.9 or 3.10, Trainer stopped freezing.

Check if this helps.

PabloVD commented 4 months ago

Yes, seems that using python 3.10 it does not freeze anymore. Thanks for the answer!

awaelchli commented 3 months ago

The code freezes (and then should crash) because it is using num_workers>0 for multiprocessing, but the script does not guard the entry point with if __name__ == "__main__" which is a requirement for multiprocessing here.