Closed awaelchli closed 1 year ago
Hi @awaelchli, I tried to take a stab at this over the last couple of hours. Here is what I found:
num_workers=0
in both dataloaders makes the script work (without uncommenting the "reset" code)for _ in test_loader:
pass
also makes the script work (again without uncommeting the "reset" code). This kind of leads me to believe that its an datasets issue.
test_dataset
with train_dataset
in test_loader
also makes the script work.All of the above is not really an answer, so I looked at what line in TM that causes this. Removing this line fixes the script: https://github.com/Lightning-AI/metrics/blob/78e9571e5e41e8ae924cd10c8200fa5d53d198e4/src/torchmetrics/utilities/distributed.py#L93-L94 however, that is the line that takes care of the distributed synchronization, so that is pretty essential. Also there does not seem to be anything wrong with that particular function. That said I can get it to work if I manually cast to another dtype like:
gathered_result = [torch.zeros_like(result).float() for _ in range(world_size)]
torch.distributed.all_gather(gathered_result, result.float(), group)
the default dtype for Accuracy
metric states are torch.long
so maybe it is a problem between torch.distributed.all_gather
and torch.long
on CUDA?
When the script fails the first part of the traceback I get contains:
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fadfd03d86e in /home/nsde/.conda/envs/metrics/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x7fadfd0083a8 in /home/nsde/.conda/envs/metrics/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7fae286ed584 in /home/nsde/.conda/envs/metrics/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1ebd5 (0x7fae286c5bd5 in /home/nsde/.conda/envs/metrics/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x265 (0x7fae286c80b5 in /home/nsde/.conda/envs/metrics/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
...
I think the important part is that it fails on delete operation: c10::cuda::CUDACachingAllocator::raw_delete(void*)
. My best guess is that the reason why
for attr, default in train_acc._defaults.items():
current_val = getattr(train_acc, attr)
setattr(train_acc, attr, default.to(current_val.device))
fixes the script is that this will manually overwrite/delete the synchronized result (which is the line that causes the problem) and not rely on the deallocation happening at the end of the script. I am not a expert in multiprocessing, so it may be complete gibberish. Based on this issue from torch: https://github.com/pytorch/pytorch/issues/67978 it seems that this error also exist for others, but no clarification what causes it (but some are also indicating that this has to do with a specific combination of dtypes).
Hello @SkafteNicki
Sorry for the (very) late reply. I couldn't spend more time on it and so it got forgotten. Thank you for documenting this and digging deeper than I could. Nice find with the cuda caching allocator. I am fine with dropping this investigation as it is not a high priority and also we don't really know what and where to fix it. If this happens to more users, we could pick it up again.
Thanks again, your time is appreciated!
Could be related to grpcio 1.53 (https://github.com/ray-project/ray/issues/34194). I faced this same bug and downgrading grpcio to 1.51.3 seems to fix the problem.
I encountered this error as well, and it turned out that torchmetrics was the culprit too. I discovered that by attaching the metrics to the LightningModule and letting Lightning handle the movement to the GPU instead of manually managing it separately, the error magically disappeared! It seems that because torchmetrics.Metric is actually a torch.nn.Module, it needs to be treated as such.
🐛 Bug
I have a very peculiar bug in which the torchmetrics Accuracy interacts with the tensors in such a way that the DataLoader iterator crashes. A
.reset()
call on the metric fixes this issue, but I don't understand why.To Reproduce
I minimized the following code as much as possible. The dataset file needs to be in the CWD: test.csv
Run this script with:
to reproduce the error:
Expected behavior
No crash. In the code above, you will find commented lines for the metric reset.
Why do these lines (which is part of Metric.reset) prevent the dataloader crash?
Observations
Issue only occurs with device on CUDA and in distributed setting. I only observed this problem when combining this dataset and torchmetrics. The problem might very well be with HF transformers or datasets, but since tochmetrics is involved, I am not sure where the problem needs to be fixed. The code above is very stupid, but it is the result of minimizing a real training script as much as possible that can reproduce the error.
Environment
conda
,pip
, build from source): pip, 0.11.1Additional context