facebookresearch / detr

End-to-End Object Detection with Transformers
Apache License 2.0
13.08k stars 2.37k forks source link

continuously growing memory #602

Open anonymoussss opened 10 months ago

anonymoussss commented 10 months ago

Hi, I am training DETR on coco dataset with default training script as follows, python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco But every time I train a few epochs, it reports an error as follows,

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).  
... ...
RuntimeError: DataLoader worker (pid 8686) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
... ...
RuntimeError: DataLoader worker (pid(s) 8686) exited unexpectedly

I checked the memory usage using free -h and found that the memory usage continued to increase until it crashed during training. How to solve this problem?

My mechine have 256G memory,8 T4 GPUs. I run the training script in a docker container with ’ --shm 256G ‘, cuda 11.7, python3.8.5, torch 2.01, torchvison 0.15.2