Closed jponnetCytomine closed 9 months ago
@jponnetCytomine I can't find any commits related to dataloading in Lightning Trainer for 2.1.1.
Connection reset by peer means the dataloader worker was killed by something externally. Have you allocated enough shared memory in your docker? I suspect the "downgrading to 2.1.0 is a solution" might just be a red herring.
@awaelchli yes I have even tried with 256GB of shared memory and it was still not working. Here is what I get in the docker when doing df -h with 32GB of shared memory from --ipc=host
:
root@14f601271aa6:/workspace# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 491G 132G 334G 29% /
tmpfs 64M 0 64M 0% /dev
/dev/vda1 491G 132G 334G 29% /workspace
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 16G 12K 16G 1% /proc/driver/nvidia
udev 16G 0 16G 0% /dev/nvidia0
tmpfs 16G 0 16G 0% /proc/asound
tmpfs 16G 0 16G 0% /proc/acpi
Note that I also tried running my docker run command with --shm-size
and not --ipc
but it did not fixed anything.
Maybe you are hitting this ominous bug in PyTorch that nobody ever knew how to resolve: https://github.com/Lightning-AI/torchmetrics/issues/1560.
At this point I can only guess. So maybe try a few things like setting persistent_workers=True/False
and pin_memory=True/False
.
Did changing any of these options help?
Hello, yes now when I do pin_memory = False it works but is it a good thing to keep pin_memory to False? Thank you !
The default in PyTorch is False. It's a bit of a mysterious feature and I don't really know much about its best practice. I've never seen a big impact of enabling this in practice.
Ok, closing this now since this was unrelated to Lightning. Sorry I couldn't give a clear answer about pin_memory
, but turning it off shouldn't impact you negatively.
Bug description
When I have lightning v2.1.1 installed and try to train a RetinaNet model, I raises the following error:
I am using a Docker Environment based on
pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
and I am using the flag--ipc=host
and I checked that I have enough shared memory which is the case (32GB in /dev/shm) as it never saturates. I can not set the workers number to 0 as the training would take too much time.After hours of research, I could finally found a solution: downgrading lightning version to v2.1.0 Here is the environment that is not working:
and if I make pip install lightning==2.1.0 then this error is not raised anymore and I can train my model. Here is how I run my docker:
docker build --gpus "all" --ipc=host --ulimit memlock=-1 --rm -it my_image:tag
What version are you seeing the problem on?
v2.1, master
How to reproduce the bug
No response
Error messages and logs
Error messages and logs here please
Environment
Current environment
``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): LightningApp #- PyTorch Lightning Version (e.g., 1.5.0): 2.1.3 #- Lightning App Version (e.g., 0.5.2): 2.1.2 #- PyTorch Version (e.g., 2.0): 2.0.1 #- Python version (e.g., 3.9): 3.10 #- OS (e.g., Linux): Linux Ubuntu #- How you installed Lightning(`conda`, `pip`, source): pip lightning==2.1.2 #- Running environment of LightningApp (e.g. local, cloud): docker env based on `pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel `More info
No response
cc @justusschock @awaelchli