Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.01k stars 3.36k forks source link

"FileExistsError: [Errno 17] File exists: '/000000_epoch_shape'" using the ddp_notebook strategy with data stored in MDS (mosaic streaming) format #20226

Open elbamos opened 3 weeks ago

elbamos commented 3 weeks ago

Bug description

Trying to train using the ddp_notebook strategy and data stored in MDS format, I get the error above with the stack trace below.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

trainer = pl.Trainer(
    accelerator='gpu', 
    devices=4, 
    strategy='ddp_notebook',
    max_epochs=10, 
    num_sanity_val_steps=0,
    callbacks=[
        EarlyStopping(monitor="pretrain_val_loss", patience=2, mode="min")
    ]
)

trainer.fit(pretrainer, train_dataloader, val_dataloaders=eval_dataloader)

Error messages and logs

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/databricks/python/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 173, in _wrapping_function
    results = function(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 212, in advance
    batch, _, __ = next(data_fetcher)
                   ^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/pytorch_lightning/loops/fetchers.py", line 133, in __next__
    batch = super().__next__()
            ^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
            ^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/pytorch_lightning/utilities/combined_loader.py", line 78, in __next__
    out[i] = next(self.iterators[i])
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/streaming/base/dataloader.py", line 58, in __iter__
    for batch in super().__iter__():
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/databricks/python/lib/python3.11/site-packages/torch/_utils.py", line 705, in reraise
    raise exception
FileExistsError: Caught FileExistsError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/streaming/base/dataset.py", line 1501, in __iter__
    sample_ids = self._get_work(epoch, sample_in_epoch)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/streaming/base/dataset.py", line 1038, in _get_work
    shape_shm, data_shm = self._share_work(epoch_sample_ids)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/streaming/base/dataset.py", line 953, in _share_work
    shape_shm = SharedMemory(name=name, create=True, size=size, auto_cleanup=False)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-db77e642-d9a8-44b4-bf30-526e1d89150e/lib/python3.11/site-packages/streaming/base/shared/memory.py", line 41, in __init__
    shm = BuiltinSharedMemory(name, create, size)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/shared_memory.py", line 104, in __init__
    self._fd = _posixshmem.shm_open(
               ^^^^^^^^^^^^^^^^^^^^^
FileExistsError: [Errno 17] File exists: '/000000_epoch_shape'

Environment

Current environment * CUDA: - GPU: - Tesla T4 - Tesla T4 - Tesla T4 - Tesla T4 - available: True - version: 12.1 * Lightning: - torch: 2.3.0+cu121 - torcheval: 0.0.7 - torchvision: 0.18.0+cu121 * Packages: - absl-py: 1.0.0 - accelerate: 0.30.1 - aiohttp: 3.8.5 - aiohttp-cors: 0.7.0 - aiosignal: 1.2.0 - anyio: 3.5.0 - argon2-cffi: 21.3.0 - argon2-cffi-bindings: 21.2.0 - astor: 0.8.1 - asttokens: 2.0.5 - astunparse: 1.6.3 - async-timeout: 4.0.2 - attrs: 22.1.0 - audioread: 3.0.1 - azure-core: 1.30.1 - azure-cosmos: 4.3.1 - azure-identity: 1.16.0 - azure-storage-blob: 12.19.1 - azure-storage-file-datalake: 12.14.0 - backcall: 0.2.0 - bcrypt: 3.2.0 - beautifulsoup4: 4.12.2 - black: 23.3.0 - bleach: 4.1.0 - blinker: 1.4 - blis: 0.7.11 - boto3: 1.34.39 - botocore: 1.34.39 - brotli: 1.0.9 - cachetools: 5.3.3 - catalogue: 2.0.10 - category-encoders: 2.6.3 - certifi: 2023.7.22 - cffi: 1.15.1 - chardet: 4.0.0 - charset-normalizer: 2.0.4 - circuitbreaker: 1.4.0 - click: 8.0.4 - cloudpathlib: 0.16.0 - cloudpickle: 2.2.1 - cmdstanpy: 1.2.2 - colorful: 0.5.6 - comm: 0.1.2 - confection: 0.1.4 - configparser: 5.2.0 - contourpy: 1.0.5 - cryptography: 41.0.3 - cycler: 0.11.0 - cymem: 2.0.8 - cython: 0.29.32 - dacite: 1.8.1 - databricks-automl-runtime: 0.2.21 - databricks-feature-engineering: 0.5.0 - databricks-sdk: 0.20.0 - dataclasses-json: 0.6.6 - datasets: 2.19.1 - dbl-tempo: 0.1.26 - dbus-python: 1.2.18 - debugpy: 1.6.7 - decorator: 5.1.1 - deepspeed: 0.14.0 - defusedxml: 0.7.1 - dill: 0.3.6 - diskcache: 5.6.3 - distlib: 0.3.8 - distro: 1.7.0 - distro-info: 1.1+ubuntu0.2 - dm-tree: 0.1.8 - einops: 0.8.0 - entrypoints: 0.4 - evaluate: 0.4.2 - executing: 0.8.3 - facets-overview: 1.1.1 - farama-notifications: 0.0.4 - fastjsonschema: 2.19.1 - fasttext: 0.9.2 - filelock: 3.13.4 - flash-attn: 2.5.8 - flask: 2.2.5 - flatbuffers: 24.3.25 - fonttools: 4.25.0 - frozenlist: 1.3.3 - fsspec: 2023.5.0 - future: 0.18.3 - gast: 0.4.0 - gitdb: 4.0.11 - gitpython: 3.1.27 - google-api-core: 2.18.0 - google-auth: 2.21.0 - google-auth-oauthlib: 1.0.0 - google-cloud-core: 2.4.1 - google-cloud-storage: 2.10.0 - google-crc32c: 1.5.0 - google-pasta: 0.2.0 - google-resumable-media: 2.7.0 - googleapis-common-protos: 1.63.0 - greenlet: 2.0.1 - grpcio: 1.60.0 - grpcio-status: 1.60.0 - gunicorn: 20.1.0 - gviz-api: 1.10.0 - gymnasium: 0.28.1 - h11: 0.14.0 - h5py: 3.10.0 - hjson: 3.1.0 - holidays: 0.45 - horovod: 0.28.1+db1 - htmlmin: 0.1.12 - httpcore: 1.0.5 - httplib2: 0.20.2 - httpx: 0.27.0 - huggingface-hub: 0.21.2 - idna: 3.4 - imagehash: 4.3.1 - imageio: 2.31.1 - imbalanced-learn: 0.11.0 - importlib-metadata: 6.0.0 - importlib-resources: 6.4.0 - ipyflow-core: 0.0.198 - ipykernel: 6.25.1 - ipython: 8.15.0 - ipython-genutils: 0.2.0 - ipywidgets: 7.7.2 - isodate: 0.6.1 - itsdangerous: 2.0.1 - jax-jumpy: 1.0.0 - jedi: 0.18.1 - jeepney: 0.7.1 - jinja2: 3.1.2 - jmespath: 0.10.0 - joblib: 1.2.0 - joblibspark: 0.5.1 - jsonpatch: 1.33 - jsonpointer: 2.4 - jsonschema: 4.17.3 - jupyter-client: 7.4.9 - jupyter-core: 5.3.0 - jupyter-server: 1.23.4 - jupyterlab-pygments: 0.1.2 - keras: 3.1.1 - keyring: 23.5.0 - kiwisolver: 1.4.4 - langchain: 0.1.20 - langchain-community: 0.0.38 - langchain-core: 0.1.52 - langchain-text-splitters: 0.0.2 - langcodes: 3.4.0 - langsmith: 0.1.63 - language-data: 1.2.0 - launchpadlib: 1.10.16 - lazr.restfulclient: 0.14.4 - lazr.uri: 1.0.6 - lazy-loader: 0.2 - libclang: 15.0.6.1 - librosa: 0.10.1 - lightgbm: 4.3.0 - linkify-it-py: 2.0.0 - llvmlite: 0.40.0 - lxml: 4.9.2 - lz4: 4.3.2 - mako: 1.2.0 - marisa-trie: 1.1.1 - markdown: 3.4.1 - markdown-it-py: 2.2.0 - markupsafe: 2.1.1 - marshmallow: 3.21.2 - matplotlib: 3.7.2 - matplotlib-inline: 0.1.6 - mdit-py-plugins: 0.3.0 - mdurl: 0.1.0 - memray: 1.12.0 - mistune: 0.8.4 - ml-dtypes: 0.3.2 - mlflow-skinny: 2.11.3 - more-itertools: 8.10.0 - mosaicml-streaming: 0.7.4 - mpmath: 1.3.0 - msal: 1.28.0 - msal-extensions: 1.1.0 - msgpack: 1.0.8 - multidict: 6.0.2 - multimethod: 1.11.2 - multiprocess: 0.70.14 - murmurhash: 1.0.10 - mypy-extensions: 0.4.3 - namex: 0.0.8 - nbclassic: 0.5.5 - nbclient: 0.5.13 - nbconvert: 6.5.4 - nbformat: 5.7.0 - nest-asyncio: 1.5.6 - networkx: 3.1 - ninja: 1.11.1.1 - nltk: 3.8.1 - notebook: 6.5.4 - notebook-shim: 0.2.2 - numba: 0.57.1 - numpy: 1.23.5 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.20.5 - nvidia-nvjitlink-cu12: 12.5.40 - nvidia-nvtx-cu12: 12.1.105 - oauthlib: 3.2.0 - oci: 2.126.4 - openai: 1.29.0 - opencensus: 0.11.4 - opencensus-context: 0.1.3 - opt-einsum: 3.3.0 - optree: 0.11.0 - orjson: 3.10.3 - packaging: 23.2 - pandas: 1.5.3 - pandocfilters: 1.5.0 - paramiko: 3.4.0 - parso: 0.8.3 - pathspec: 0.10.3 - patsy: 0.5.3 - petastorm: 0.12.1 - pexpect: 4.8.0 - phik: 0.12.4 - pickleshare: 0.7.5 - pillow: 9.4.0 - pip: 23.2.1 - platformdirs: 3.10.0 - plotly: 5.9.0 - pmdarima: 2.0.4 - pooch: 1.8.1 - portalocker: 2.8.2 - preshed: 3.0.9 - prometheus-client: 0.14.1 - prompt-toolkit: 3.0.36 - prophet: 1.1.5 - proto-plus: 1.23.0 - protobuf: 4.24.1 - psutil: 5.9.0 - psycopg2: 2.9.3 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - py-cpuinfo: 8.0.0 - py-spy: 0.3.14 - pyarrow: 14.0.1 - pyarrow-hotfix: 0.6 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pybind11: 2.12.0 - pyccolo: 0.0.52 - pycparser: 2.21 - pydantic: 1.10.6 - pygments: 2.15.1 - pygobject: 3.42.1 - pyjwt: 2.3.0 - pynacl: 1.5.0 - pynvml: 11.5.0 - pyodbc: 4.0.38 - pyopenssl: 23.2.0 - pyparsing: 3.0.9 - pyrsistent: 0.18.0 - pytesseract: 0.3.10 - python-apt: 2.4.0+ubuntu3 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-lsp-jsonrpc: 1.1.1 - python-snappy: 0.6.1 - pytz: 2022.7 - pywavelets: 1.4.1 - pyyaml: 6.0 - pyzmq: 23.2.0 - ray: 2.12.0 - regex: 2022.7.9 - requests: 2.31.0 - requests-oauthlib: 1.3.1 - rich: 13.7.1 - rsa: 4.9 - s3transfer: 0.10.1 - safetensors: 0.4.2 - scikit-image: 0.20.0 - scikit-learn: 1.3.0 - scipy: 1.11.1 - seaborn: 0.12.2 - secretstorage: 3.3.1 - send2trash: 1.8.0 - sentence-transformers: 2.7.0 - sentencepiece: 0.1.99 - setuptools: 68.0.0 - shap: 0.44.0 - simplejson: 3.17.6 - six: 1.16.0 - slicer: 0.0.7 - smart-open: 5.2.1 - smmap: 5.0.0 - sniffio: 1.2.0 - soundfile: 0.12.1 - soupsieve: 2.4 - soxr: 0.3.7 - spacy: 3.7.2 - spacy-legacy: 3.0.12 - spacy-loggers: 1.0.5 - spark-tensorflow-distributor: 1.0.0 - sqlalchemy: 1.4.39 - sqlparse: 0.4.2 - srsly: 2.4.8 - ssh-import-id: 5.11 - stack-data: 0.2.0 - stanio: 0.5.0 - statsmodels: 0.14.0 - sympy: 1.11.1 - tangled-up-in-unicode: 0.2.0 - tenacity: 8.2.2 - tensorboard: 2.16.2 - tensorboard-data-server: 0.7.2 - tensorboard-plugin-profile: 2.15.1 - tensorboardx: 2.6.2.2 - tensorflow: 2.16.1 - tensorflow-estimator: 2.15.0 - tensorflow-io-gcs-filesystem: 0.37.0 - termcolor: 2.4.0 - terminado: 0.17.1 - textual: 0.63.3 - tf-keras: 2.16.0 - thinc: 8.2.3 - threadpoolctl: 2.2.0 - tifffile: 2021.7.2 - tiktoken: 0.5.2 - tinycss2: 1.2.1 - tokenize-rt: 4.2.1 - tokenizers: 0.19.0 - torch: 2.3.0+cu121 - torcheval: 0.0.7 - torchvision: 0.18.0+cu121 - tornado: 6.3.2 - tqdm: 4.65.0 - traitlets: 5.7.1 - transformers: 4.40.2 - triton: 2.3.0 - typeguard: 2.13.3 - typer: 0.9.4 - typing-extensions: 4.10.0 - typing-inspect: 0.9.0 - tzdata: 2022.1 - uc-micro-py: 1.0.1 - ujson: 5.4.0 - unattended-upgrades: 0.1 - urllib3: 1.26.16 - virtualenv: 20.24.2 - visions: 0.7.5 - wadllib: 1.3.6 - wasabi: 1.1.2 - wcwidth: 0.2.5 - weasel: 0.3.4 - webencodings: 0.5.1 - websocket-client: 0.58.0 - werkzeug: 2.2.3 - wheel: 0.38.4 - wordcloud: 1.9.3 - wrapt: 1.14.1 - xgboost: 2.0.3 - xxhash: 3.4.1 - yarl: 1.8.1 - ydata-profiling: 4.5.1 - zipp: 3.11.0 - zstd: 1.5.5.1 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.11.0rc1 - release: 5.15.0-1065-aws - version: #71~20.04.1-Ubuntu SMP Fri Jun 28 19:58:04 UTC 2024

More info

I'm suspicious that this is an incompatibility between pytorch lightning and Mosaic streaming. The Mosaic code to load the datasets is:

from streaming.base.util import clean_stale_shared_memory
from streaming import StreamingDataset, StreamingDataLoader

data_storage_location =...
experiment_name = ...

def get_dataloader_with_mosaic(path, batch_size, shuffle=False):
  # Utility function to clean up stale shared memory during distributed training
  clean_stale_shared_memory()

  # Creating the `StreamingDataset` object and the `StreamingDataLoader` object.
  dataset = StreamingDataset(local=path, shuffle=shuffle, batch_size=batch_size)
  return StreamingDataLoader(dataset, batch_size=batch_size, num_workers=31, drop_last=True, persistent_workers=True), dataset

eval_dataloader, eval_dataset = get_dataloader_with_mosaic(f"{data_storage_location}/mds_{experiment_name}_val", batch_size=256, shuffle=False)
train_dataloader, train_dataset = get_dataloader_with_mosaic(f"{data_storage_location}/mds_{experiment_name}_train", batch_size=32, shuffle=True)
elbamos commented 3 weeks ago

Having traced through the code, what I suspect is happening is that the ddp_notebook strategy is not setting in the forked processes the environment variables that mosaicml is expecting. Mosaicml in the forked processes all think they are rank 0, and they're therefore all trying to write the same shared memory file.

elbamos commented 3 weeks ago

It appears to me that pytorch lightning is setting LOCAL_RANK and NODE_RANK but not RANK, which mosaicml is expecting (and which pytorch sets). https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html#distributed-data-parallel Is there any hope of changing this on the lightning side of things?