Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.53k stars 3.39k forks source link

StreamingDataset not working in multi-gpu environement #20140

Open davidpicard opened 4 months ago

davidpicard commented 4 months ago

Bug description

I'm trying to use the Streaming library by Mosaic as described in the doc (https://lightning.ai/docs/pytorch/stable/data/alternatives.html) but it doesn't work as seamlessly as expected.

When trying the default approach as described in the documentation with 4 GPUs on a single node:

train_dataset = YourDataset()
train_dataloader = DataLoader(train_dataset, batch_size=batch_size)
model = ...
trainer = L.Trainer()
trainer.fit(model, train_dataloader)

results in the following error:

FileExistsError: [Errno 17] File exists: '/000001_locals'

Mu understanding is that the environment variables (WORLD_SIZE, LOCAL_WORLD_SIZE, RANK, etc) are not set properly in time and thus the StreamingDataset does not know it is running in a multi-gpu/multi-node and thus the different processes are unaware of each others.

You can create the dataset after the trainer, but it does not change the outcome.

Has anyone been successful in running Streaming with lightning in a multi-node/multi-gpu setup?

What version are you seeing the problem on?

v2.2

How to reproduce the bug

No response

Error messages and logs

No response

Environment

Current environment * CUDA: - GPU: None - available: False - version: 12.1 * Lightning: - ema-pytorch: 0.4.2 - lightning: 2.2.4 - lightning-utilities: 0.10.1 - pytorch-lightning: 2.2.1 - torch: 2.2.1 - torchinfo: 1.8.0 - torchmetrics: 1.3.0.post0 - torchvision: 0.17.1 * Packages: - absl-py: 2.1.0 - accelerate: 0.28.0 - aiohttp: 3.9.1 - aiosignal: 1.3.1 - antlr4-python3-runtime: 4.9.3 - appdirs: 1.4.4 - async-timeout: 4.0.3 - attrs: 23.2.0 - azure-core: 1.30.2 - azure-identity: 1.17.1 - azure-storage-blob: 12.21.0 - azure-storage-file-datalake: 12.16.0 - bcrypt: 4.2.0 - beartype: 0.17.2 - boto3: 1.34.149 - botocore: 1.34.149 - brotli: 1.1.0 - cachetools: 5.3.2 - certifi: 2023.11.17 - cffi: 1.16.0 - charset-normalizer: 3.3.2 - circuitbreaker: 1.4.0 - click: 8.1.7 - configparser: 6.0.0 - consistencydecoder: 1.0 - contourpy: 1.2.1 - cramjam: 2.8.3 - cryptography: 42.0.8 - cycler: 0.12.1 - diffusers: 0.27.0 - docker-pycreds: 0.4.0 - einops: 0.7.0 - ema-pytorch: 0.4.2 - filelock: 3.13.1 - fonttools: 4.53.1 - frozenlist: 1.4.1 - fsspec: 2023.12.2 - gitdb: 4.0.11 - gitpython: 3.1.41 - google-api-core: 2.19.1 - google-auth: 2.26.2 - google-auth-oauthlib: 1.2.0 - google-cloud-core: 2.4.1 - google-cloud-storage: 2.10.0 - google-crc32c: 1.5.0 - google-resumable-media: 2.7.1 - googleapis-common-protos: 1.63.2 - grpcio: 1.60.0 - huggingface-hub: 0.21.4 - hydra-core: 1.3.2 - idna: 3.6 - importlib-metadata: 7.0.2 - isodate: 0.6.1 - jinja2: 3.1.3 - jmespath: 1.0.1 - kiwisolver: 1.4.5 - lightning: 2.2.4 - lightning-utilities: 0.10.1 - lsuv: 0.2.2 - markdown: 3.5.2 - markupsafe: 2.1.3 - matplotlib: 3.9.1 - mosaicml-streaming: 0.7.6 - mpmath: 1.3.0 - msal: 1.30.0 - msal-extensions: 1.2.0 - multidict: 6.0.4 - networkx: 3.2.1 - numpy: 1.26.3 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.19.3 - nvidia-nvjitlink-cu12: 12.3.101 - nvidia-nvtx-cu12: 12.1.105 - oauthlib: 3.2.2 - oci: 2.129.4 - omegaconf: 2.3.0 - packaging: 23.2 - paramiko: 3.4.0 - pathtools: 0.1.2 - pillow: 10.2.0 - pip: 24.1.2 - portalocker: 2.10.1 - promise: 2.3 - proto-plus: 1.24.0 - protobuf: 4.23.4 - psutil: 5.9.7 - pyasn1: 0.5.1 - pyasn1-modules: 0.3.0 - pycparser: 2.22 - pyjwt: 2.8.0 - pynacl: 1.5.0 - pyopenssl: 24.2.1 - pyparsing: 3.1.2 - python-dateutil: 2.8.2 - python-snappy: 0.7.2 - pytorch-lightning: 2.2.1 - pytz: 2024.1 - pyyaml: 6.0.1 - regex: 2023.12.25 - requests: 2.31.0 - requests-oauthlib: 1.3.1 - rsa: 4.9 - s3transfer: 0.10.2 - safetensors: 0.4.2 - scipy: 1.12.0 - sentry-sdk: 1.39.2 - setproctitle: 1.3.3 - setuptools: 58.1.0 - shortuuid: 1.0.11 - six: 1.16.0 - smmap: 5.0.1 - subprocess32: 3.5.4 - sympy: 1.12 - tensorboard: 2.15.1 - tensorboard-data-server: 0.7.2 - termcolor: 2.4.0 - timm: 0.9.16 - tokenizers: 0.19.1 - torch: 2.2.1 - torchinfo: 1.8.0 - torchmetrics: 1.3.0.post0 - torchvision: 0.17.1 - tqdm: 4.66.1 - transformers: 4.40.2 - triton: 2.2.0 - typing-extensions: 4.9.0 - urllib3: 2.1.0 - wandb: 0.16.6 - werkzeug: 3.0.1 - xxhash: 3.4.1 - yarl: 1.9.4 - yaspin: 3.0.1 - zipp: 3.18.1 - zstd: 1.5.5.1 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.4 - release: 5.14.0-284.55.1.el9_2.x86_64 - version: #1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024

More info

No response

awaelchli commented 4 months ago

@davidpicard Did you try putting the dataloader into the designated hooks, e.g., LightningModule.train_dataloader? The hooks are meant to be used for this so that dataloader gets set up once Trainer has launched.

davidpicard commented 4 months ago

I'm currently using a LightningDataModule and the datasets are created in the setup() hook. The doc indicates that it is performed on every GPU, so it should be the right place, no? I can try creating the dataset in the train_dataloader() method of the data module, if anything.

tedfeng424 commented 3 months ago

I ran into similar problems, have you found a fix for this?