continuously growing memory

Hi, I am training DETR on coco dataset with default training script as follows, python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco But every time I train a few epochs, it reports an error as follows,

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).  
... ...
RuntimeError: DataLoader worker (pid 8686) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
... ...
RuntimeError: DataLoader worker (pid(s) 8686) exited unexpectedly

I checked the memory usage using free -h and found that the memory usage continued to increase until it crashed during training. How to solve this problem?

My mechine have 256G memory，8 T4 GPUs. I run the training script in a docker container with ’ --shm 256G ‘, cuda 11.7， python3.8.5, torch 2.01, torchvison 0.15.2

facebookresearch / detr

continuously growing memory #602