What is your GPU configuration

all1new commented 2 years ago

RuntimeError: CUDA out of memory. Tried to allocate 3.12 GiB (GPU 0; 10.75 GiB total capacity; 4.94 GiB already allocated; 2.89 GiB free; 6.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1585902) of binary: /home/wyy/anaconda3/envs/SPL/bin/python

hi, when I train, The above error always occurs. What is your GPU configuration？

WendongZh commented 2 years ago

Thanks for your interests!

As we mentioned in our paper, we use two V100 GPU cards with memory size 32G for training. I remember that using 3090 GPU with 24G memory also works. If you do not have such large memory, I suggest that you can use batchsize=2 even 1 (totally batchsize=4 or 2 for two cards) and remove the --with_test parameter (directly remove it from the training command). Actually, training with smaller batchsize may further improve the final performance.

The reason for using such large memory is mainly due to the use of the pretrained ASL model. If you just want to try our model with your own datasets, you can replace it with other smaller pretrained model, such as vgg19 (of course, extra modifications need to be made), or pretrained model (segmentation, classification, detection) on your own datasets. The performance will drop but you may still get a better results compared with our baseline RN.

Good luck for you!

all1new commented 2 years ago

ARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6585 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6584) of binary: hi, when I train, The above error always occurs. i have no idea , look forward your reply

WendongZh commented 2 years ago

Can you show your full training command and full error output? What kind of GPU you use?

Reducing the batch size should address the out of memory problem.

Another problem may lie in that, as we mentioned in ReadMe, currently, you need at least two GPU cards for training.

WendongZh / SPL

What is your GPU configuration #9