Open knoriy opened 1 year ago
Torchdata requires extra setup and shutdown calls that Lightning doesn't do for you at the moment: https://github.com/Lightning-AI/lightning/issues/16603. This might be what's causing the issue.
So using torchdata
with lightning is currently unexplored territory. It would be welcome if you find out what's wrong or want to contribute fixes to the integration.
Thank you for the comment. I'll have a look into it. If I solve it or find anything meaningful, I'll open a pull request.
Hi @knoriy did you solve this? I have the same issue since march
Ive seen issues that stem from using datapipes with the old dataloader class. Maybe using DataLoader2 from torchdata helps
@carmocca for me it crashed randomly after saving checkpoint, sometimes crashed, sometimes no.
.sharding_filter()\ .open_files_by_fsspec(mode='rb')\ .load_from_tar() \
A workaround that's worked for me is to move sharding_filter
below load_from_tar
. It's not ideal because you are loading data without sharding. But fixed most of the issues.
Try this:
def _create_pipeline(self, data_dir):
datapipe = torchdata.datapipes.iter.IterableWrapper(data_dir)\
.shuffle()\
.open_files_by_fsspec(mode='rb')\
.load_from_tar() \
.sharding_filter() \
.batch(2) \
.map(self.to_sampels)
return datapipe
Ive seen issues that stem from using datapipes with the old dataloader class. Maybe using DataLoader2 from torchdata helps
For me, dataloader2 causes issues when using Reading Services; it leads to freezing and worse performance. The classic dataloader worked best for me when using PL and TorchData.
cc @ejguan
I think the main problem is unbalanced data sharding across distributed ranks, which causes hanging.
You can always attach a fullsync
DataPipe at the end of your pipeline.
For me, dataloader2 causes issues when using Reading Services; it leads to freezing and worse performance. The classic dataloader worked best for me when using PL and TorchData.
Can you pls shed more light on this? In theory and based on our benchmarking, DataLoader2 should perform better than DataLoader.
Can you pls shed more light on this? In theory and based on our benchmarking, DataLoader2 should perform better than DataLoader.
Thank you, I'll try adding fullsync
with dataloader2.
Feel free to ask for anything I miss here:
The cluster manager is Slurm
; using openmpi
; PL version 1.9.x
. The data is streamed from cloud storage using fsspec
. dataloader2
uses both DistributedReadingService
and MultiProcessingReadingService
. I haven't tested these extensively, but from observations, DataLoader is about ~1.5x to ~2x faster; it seems to play better with PL, and scaling is more consistent. Adding more GPU when using DataLoader2 was slower for me.
Other observations:
.shuffle()
placed after .load_from_tar()
is extremely slow, reducing buffer_size
helps.@ejguan Does the order of reading services matter?
I am having an issue of very slow training after something on the cluster I am using got updated, which I am trying to figure out with the admins, but I can see there are some difference in logs I am getting
in particular, I am receiving very similar logs as in this post, my nccl logs are:
NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
NCCL INFO NET/OFI Configuring AWS-specific options
NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
NCCL INFO NET/OFI Running on p4d.24xlarge platform, NCCL_TOPO_FILE environment variable is already set to /usr/local/cuda-11.3/efa/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
however, previously I was getting the logs:
NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-11.3/efa/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
NCCL INFO NET/OFI Selected Provider is efa
can the change in aws-ofi-nccl version from 1.4.0aws --> 1.5.0aws have caused the issue? also what does "(found 4 nics)" mean in the last line of the new logs above, it's something not present in the old logs?
Update:
I've been stepping through the PL code, which looks to happen in the Closure class pytorch_lightning.loops.optimization.automatic: Closure
. More specifically, line 137 and line 141 when self._result.loss
is called.
Further notes and things that may help isolate this issue:
sync_dist=True
also causes the freezing
Bug description
Training freezes when using
ddp
on slurm cluster (dp
runs as expected). The dataset is loaded via torchdata from an s3 bucket. Similar behaviour also arises when using webdataset.Possibly a linked issue: https://github.com/Lightning-AI/lightning/issues/16893#issue-1602261381
Error:
No Error is thrown
UPDATE:
Removing
val_step
andtest_step
frompl.LightningModule
gives us the following:How to reproduce the bug
Error messages and logs
Environment
Current environment
``` * CUDA: - GPU: - NVIDIA A100-SXM4-40GB - NVIDIA A100-SXM4-40GB - available: True - version: 11.7 * Lightning: - lightning-utilities: 0.8.0 - pytorch-lightning: 1.9.4 - torch: 1.13.1 - torchaudio: 0.12.1 - torchdata: 0.5.1 - torchmetrics: 0.9.3 * Packages: - absl-py: 1.2.0 - aiobotocore: 2.4.2 - aiohttp: 3.8.3 - aioitertools: 0.11.0 - aiosignal: 1.2.0 - appdirs: 1.4.4 - async-timeout: 4.0.2 - attrs: 22.1.0 - audioread: 3.0.0 - botocore: 1.27.59 - braceexpand: 0.1.7 - cachetools: 5.2.0 - certifi: 2022.9.24 - cffi: 1.15.1 - charset-normalizer: 2.1.1 - click: 8.1.3 - contourpy: 1.0.5 - cycler: 0.11.0 - decorator: 5.1.1 - deepspeed: 0.8.2 - docker-pycreds: 0.4.0 - filelock: 3.8.0 - fonttools: 4.37.4 - frozenlist: 1.3.1 - fsspec: 2023.3.0 - gitdb: 4.0.10 - gitpython: 3.1.31 - google-auth: 2.12.0 - google-auth-oauthlib: 0.4.6 - grpcio: 1.49.1 - hjson: 3.1.0 - huggingface-hub: 0.10.0 - idna: 3.4 - importlib-metadata: 5.0.0 - inflect: 6.0.0 - jmespath: 1.0.1 - joblib: 1.2.0 - kiwisolver: 1.4.4 - librosa: 0.9.2 - lightning-utilities: 0.8.0 - llvmlite: 0.39.1 - markdown: 3.4.1 - markupsafe: 2.1.1 - matplotlib: 3.6.0 - mkl-fft: 1.3.1 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - more-itertools: 8.14.0 - multidict: 6.0.2 - ninja: 1.11.1 - numba: 0.56.2 - numpy: 1.23.1 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: 8.5.0.96 - oauthlib: 3.2.1 - packaging: 21.3 - pathtools: 0.1.2 - pillow: 9.2.0 - pip: 22.2.2 - pooch: 1.6.0 - portalocker: 2.7.0 - protobuf: 3.19.6 - psutil: 5.9.4 - py-cpuinfo: 9.0.0 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pycparser: 2.21 - pydantic: 1.10.2 - pydeprecate: 0.3.2 - pyparsing: 3.0.9 - python-dateutil: 2.8.2 - pytorch-lightning: 1.9.4 - pyyaml: 6.0 - regex: 2022.9.13 - requests: 2.28.1 - requests-oauthlib: 1.3.1 - resampy: 0.4.2 - rsa: 4.9 - s3fs: 2023.3.0 - scikit-learn: 1.1.2 - scipy: 1.9.1 - sentry-sdk: 1.16.0 - setproctitle: 1.3.2 - setuptools: 59.8.0 - six: 1.16.0 - smmap: 5.0.0 - soundfile: 0.11.0 - tensorboard: 2.10.1 - tensorboard-data-server: 0.6.1 - tensorboard-plugin-wit: 1.8.1 - threadpoolctl: 3.1.0 - tokenizers: 0.12.1 - torch: 1.13.1 - torchaudio: 0.12.1 - torchdata: 0.5.1 - torchmetrics: 0.9.3 - tqdm: 4.64.1 - transformers: 4.22.2 - typing-extensions: 4.3.0 - unidecode: 1.3.6 - urllib3: 1.26.12 - wandb: 0.13.11 - webdataset: 0.2.26 - werkzeug: 2.2.2 - wheel: 0.37.1 - wrapt: 1.15.0 - yarl: 1.8.1 - zipp: 3.8.1 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.9.13 - version: #23~20.04.1-Ubuntu SMP Thu Aug 18 03:20:14 UTC 2022 * CUDA: - GPU: - NVIDIA A100-SXM4-40GB - NVIDIA A100-SXM4-40GB - available: True - version: 11.7 * Lightning: - lightning-utilities: 0.8.0 - pytorch-lightning: 1.9.4 - torch: 1.13.1 - torchaudio: 0.12.1 - torchdata: 0.5.1 - torchmetrics: 0.9.3 * Packages: - absl-py: 1.2.0 - aiobotocore: 2.4.2 - aiohttp: 3.8.3 - aioitertools: 0.11.0 - aiosignal: 1.2.0 - appdirs: 1.4.4 - async-timeout: 4.0.2 - attrs: 22.1.0 - audioread: 3.0.0 - botocore: 1.27.59 - braceexpand: 0.1.7 - cachetools: 5.2.0 - certifi: 2022.9.24 - cffi: 1.15.1 - charset-normalizer: 2.1.1 - click: 8.1.3 - contourpy: 1.0.5 - cycler: 0.11.0 - decorator: 5.1.1 - deepspeed: 0.8.2 - docker-pycreds: 0.4.0 - filelock: 3.8.0 - fonttools: 4.37.4 - frozenlist: 1.3.1 - fsspec: 2023.3.0 - gitdb: 4.0.10 - gitpython: 3.1.31 - google-auth: 2.12.0 - google-auth-oauthlib: 0.4.6 - grpcio: 1.49.1 - hjson: 3.1.0 - huggingface-hub: 0.10.0 - idna: 3.4 - importlib-metadata: 5.0.0 - inflect: 6.0.0 - jmespath: 1.0.1 - joblib: 1.2.0 - kiwisolver: 1.4.4 - librosa: 0.9.2 - lightning-utilities: 0.8.0 - llvmlite: 0.39.1 - markdown: 3.4.1 - markupsafe: 2.1.1 - matplotlib: 3.6.0 - mkl-fft: 1.3.1 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - more-itertools: 8.14.0 - multidict: 6.0.2 - ninja: 1.11.1 - numba: 0.56.2 - numpy: 1.23.1 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: 8.5.0.96 - oauthlib: 3.2.1 - packaging: 21.3 - pathtools: 0.1.2 - pillow: 9.2.0 - pip: 22.2.2 - pooch: 1.6.0 - portalocker: 2.7.0 - protobuf: 3.19.6 - psutil: 5.9.4 - py-cpuinfo: 9.0.0 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pycparser: 2.21 - pydantic: 1.10.2 - pydeprecate: 0.3.2 - pyparsing: 3.0.9 - python-dateutil: 2.8.2 - pytorch-lightning: 1.9.4 - pyyaml: 6.0 - regex: 2022.9.13 - requests: 2.28.1 - requests-oauthlib: 1.3.1 - resampy: 0.4.2 - rsa: 4.9 - s3fs: 2023.3.0 - scikit-learn: 1.1.2 - scipy: 1.9.1 - sentry-sdk: 1.16.0 - setproctitle: 1.3.2 - setuptools: 59.8.0 - six: 1.16.0 - smmap: 5.0.0 - soundfile: 0.11.0 - tensorboard: 2.10.1 - tensorboard-data-server: 0.6.1 - tensorboard-plugin-wit: 1.8.1 - threadpoolctl: 3.1.0 - tokenizers: 0.12.1 - torch: 1.13.1 - torchaudio: 0.12.1 - torchdata: 0.5.1 - torchmetrics: 0.9.3 - tqdm: 4.64.1 - transformers: 4.22.2 - typing-extensions: 4.3.0 - unidecode: 1.3.6 - urllib3: 1.26.12 - wandb: 0.13.11 - webdataset: 0.2.26 - werkzeug: 2.2.2 - wheel: 0.37.1 - wrapt: 1.15.0 - yarl: 1.8.1 - zipp: 3.8.1 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.9.13 - version: #23~20.04.1-Ubuntu SMP Thu Aug 18 03:20:14 UTC 2022 ```More info
The model is able to finish an epoch when (line 51)
.sharding_filter()\
is removed, but this result in undesirable behavior, if turned off workers will return the same batch multiple timecc @justusschock @awaelchli