Open davidpicard opened 4 months ago
@davidpicard Did you try putting the dataloader into the designated hooks, e.g., LightningModule.train_dataloader
?
The hooks are meant to be used for this so that dataloader gets set up once Trainer has launched.
I'm currently using a LightningDataModule
and the datasets are created in the setup()
hook. The doc indicates that it is performed on every GPU, so it should be the right place, no? I can try creating the dataset in the train_dataloader()
method of the data module, if anything.
I ran into similar problems, have you found a fix for this?
Bug description
I'm trying to use the Streaming library by Mosaic as described in the doc (https://lightning.ai/docs/pytorch/stable/data/alternatives.html) but it doesn't work as seamlessly as expected.
When trying the default approach as described in the documentation with 4 GPUs on a single node:
results in the following error:
Mu understanding is that the environment variables (WORLD_SIZE, LOCAL_WORLD_SIZE, RANK, etc) are not set properly in time and thus the StreamingDataset does not know it is running in a multi-gpu/multi-node and thus the different processes are unaware of each others.
You can create the dataset after the trainer, but it does not change the outcome.
Has anyone been successful in running Streaming with lightning in a multi-node/multi-gpu setup?
What version are you seeing the problem on?
v2.2
How to reproduce the bug
No response
Error messages and logs
No response
Environment
Current environment
* CUDA: - GPU: None - available: False - version: 12.1 * Lightning: - ema-pytorch: 0.4.2 - lightning: 2.2.4 - lightning-utilities: 0.10.1 - pytorch-lightning: 2.2.1 - torch: 2.2.1 - torchinfo: 1.8.0 - torchmetrics: 1.3.0.post0 - torchvision: 0.17.1 * Packages: - absl-py: 2.1.0 - accelerate: 0.28.0 - aiohttp: 3.9.1 - aiosignal: 1.3.1 - antlr4-python3-runtime: 4.9.3 - appdirs: 1.4.4 - async-timeout: 4.0.3 - attrs: 23.2.0 - azure-core: 1.30.2 - azure-identity: 1.17.1 - azure-storage-blob: 12.21.0 - azure-storage-file-datalake: 12.16.0 - bcrypt: 4.2.0 - beartype: 0.17.2 - boto3: 1.34.149 - botocore: 1.34.149 - brotli: 1.1.0 - cachetools: 5.3.2 - certifi: 2023.11.17 - cffi: 1.16.0 - charset-normalizer: 3.3.2 - circuitbreaker: 1.4.0 - click: 8.1.7 - configparser: 6.0.0 - consistencydecoder: 1.0 - contourpy: 1.2.1 - cramjam: 2.8.3 - cryptography: 42.0.8 - cycler: 0.12.1 - diffusers: 0.27.0 - docker-pycreds: 0.4.0 - einops: 0.7.0 - ema-pytorch: 0.4.2 - filelock: 3.13.1 - fonttools: 4.53.1 - frozenlist: 1.4.1 - fsspec: 2023.12.2 - gitdb: 4.0.11 - gitpython: 3.1.41 - google-api-core: 2.19.1 - google-auth: 2.26.2 - google-auth-oauthlib: 1.2.0 - google-cloud-core: 2.4.1 - google-cloud-storage: 2.10.0 - google-crc32c: 1.5.0 - google-resumable-media: 2.7.1 - googleapis-common-protos: 1.63.2 - grpcio: 1.60.0 - huggingface-hub: 0.21.4 - hydra-core: 1.3.2 - idna: 3.6 - importlib-metadata: 7.0.2 - isodate: 0.6.1 - jinja2: 3.1.3 - jmespath: 1.0.1 - kiwisolver: 1.4.5 - lightning: 2.2.4 - lightning-utilities: 0.10.1 - lsuv: 0.2.2 - markdown: 3.5.2 - markupsafe: 2.1.3 - matplotlib: 3.9.1 - mosaicml-streaming: 0.7.6 - mpmath: 1.3.0 - msal: 1.30.0 - msal-extensions: 1.2.0 - multidict: 6.0.4 - networkx: 3.2.1 - numpy: 1.26.3 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.19.3 - nvidia-nvjitlink-cu12: 12.3.101 - nvidia-nvtx-cu12: 12.1.105 - oauthlib: 3.2.2 - oci: 2.129.4 - omegaconf: 2.3.0 - packaging: 23.2 - paramiko: 3.4.0 - pathtools: 0.1.2 - pillow: 10.2.0 - pip: 24.1.2 - portalocker: 2.10.1 - promise: 2.3 - proto-plus: 1.24.0 - protobuf: 4.23.4 - psutil: 5.9.7 - pyasn1: 0.5.1 - pyasn1-modules: 0.3.0 - pycparser: 2.22 - pyjwt: 2.8.0 - pynacl: 1.5.0 - pyopenssl: 24.2.1 - pyparsing: 3.1.2 - python-dateutil: 2.8.2 - python-snappy: 0.7.2 - pytorch-lightning: 2.2.1 - pytz: 2024.1 - pyyaml: 6.0.1 - regex: 2023.12.25 - requests: 2.31.0 - requests-oauthlib: 1.3.1 - rsa: 4.9 - s3transfer: 0.10.2 - safetensors: 0.4.2 - scipy: 1.12.0 - sentry-sdk: 1.39.2 - setproctitle: 1.3.3 - setuptools: 58.1.0 - shortuuid: 1.0.11 - six: 1.16.0 - smmap: 5.0.1 - subprocess32: 3.5.4 - sympy: 1.12 - tensorboard: 2.15.1 - tensorboard-data-server: 0.7.2 - termcolor: 2.4.0 - timm: 0.9.16 - tokenizers: 0.19.1 - torch: 2.2.1 - torchinfo: 1.8.0 - torchmetrics: 1.3.0.post0 - torchvision: 0.17.1 - tqdm: 4.66.1 - transformers: 4.40.2 - triton: 2.2.0 - typing-extensions: 4.9.0 - urllib3: 2.1.0 - wandb: 0.16.6 - werkzeug: 3.0.1 - xxhash: 3.4.1 - yarl: 1.9.4 - yaspin: 3.0.1 - zipp: 3.18.1 - zstd: 1.5.5.1 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.4 - release: 5.14.0-284.55.1.el9_2.x86_64 - version: #1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024More info
No response