Open alexanderswerdlow opened 1 week ago
Hi @alexanderswerdlow
Can you show an example that produces this error? I am not aware of any serious issue regarding dataloading in Lightning. From the error message, we can see that the dataloader worker failed, and that it had a cuda initialization error. Using any CUDA operations in your dataloading workers is not supported/recommended by PyTorch, so naturally we would expect to see issues with or without Lightning involved.
To be able to help you, I would need to see some evidence that the issue is caused by Lightning, and some code to work with to isolate the cause of it.
Thanks for responding! I don't have the time to continue to debug it at the moment and provide a full repro, but switching to the following works [only needed for the val dataloader]. There are no specific cuda operations in my dataloaders [and this bug happens with and without pin_memory=True]. I can confidently say it has happened with a simple torchvision imagenet dataset.
I should note it also happens on two different machines and on a newly installed conda env.
This strongly suggests it's a lightning issue. I spent a while digging into how lightning wraps dataloaders and it errors out around here, but again, not during the sanity check.
These issues [#19763, #17378, #19598] also seem to be discussing the same issue, specifically the last one [#19598] that mentions this only occurring when passing a val dataloader. I should not I am not using torch.compile
and this behavior occurs even when I removed all usages of torchmetrics
.
Working [no worker] dataloader:
from torch.utils.data import default_collate
class SimpleDataLoader:
def __init__(self, dataset, batch_size=1, collate_fn=default_collate, **kwargs):
self.dataset = dataset
self.batch_size = batch_size
self.collate_fn = collate_fn
self.idx = 0
def __iter__(self):
return self
def __next__(self):
if self.idx < len(self.dataset):
batch = []
for _ in range(self.batch_size):
if self.idx >= len(self.dataset):
break
batch.append(self.dataset[self.idx])
self.idx += 1
return self.collate_fn(batch)
else:
raise StopIteration
def __len__(self):
return (len(self.dataset) + self.batch_size - 1) // self.batch_size
Lightning does not wrap the dataloaders. It only injects a distributed sampler when you are using a torch DataLoader, because that sampler is needed for distributed training. For dataloaders with iterable datasets, also Lightning doesn't do anything, because the user has to take care of the implementation.
The issues you linked are open for the same reason, users are not able to provide the code that reproduces the problems, which means it's not possible to investigate only based on the error message. If we have that, it will be possible for me or someone from the community to determine the root cause, which is the first step.
Bug description
Having a dataloader with >0 workers causes a crash. This behavior occurs both with custom datasets, and even standard huggingface datasets, and torchvision datasets.
The dataloaders work fine standalone with many workers, and also work with accelerate just fine.
The run general works until the first validation step at which point it crashes. Interestingly, num_sanity_val_steps works fine [e.g.,
num_sanity_val_steps=10
].Working version:
Not working:
What version are you seeing the problem on?
v2.2, master
How to reproduce the bug
No response
Error messages and logs
Traceback:
Environment
Current environment
* CUDA: - GPU: - NVIDIA RTX A6000 - NVIDIA RTX A6000 - available: True - version: 12.1 * Lightning: - lightning: 2.3.2 - lightning-utilities: 0.11.2 - pytorch-lightning: 2.3.1 - torch: 2.3.1 - torch-fidelity: 0.3.0 - torch-tb-profiler: 0.4.3 - torchaudio: 2.3.1 - torchmetrics: 1.4.0.post0 - torchvision: 0.18.1 - torchx: 0.6.0 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.14 - release: 4.18.0-372.32.1.el8_6.x86_64More info
No response
cc @justusschock @awaelchli