Closed Farzin-Negahbani closed 4 years ago
Hi @Farzin-Negahbani, Thanks for your interest. The default batch size is 4 on two GPUs (two samples per GPU). In your case, batch_size=8 and num_gpu=4 should work. Could you help to provide more information? During your training, is there any other process that consumes GPU memory already (when you check using nvidia-smi)? How many threads does your CPU support? Besides the batch_size and num_gpu, are there any other modifications? Thanks,
I have four NVIDIA 1080TI and I'm running the training script by
python train.py configs/car_auto_T3_train_train_config configs/car_auto_T3_train_config --dataset_root_dir KITTI/
withbatch_size=16
andNUM_GPU=4
. After a few minutes I get OOM error with the following stats:I tried
batch_size=8
also same thing happens but when I chosebatch_size=4
it works and allocates around 8.5 GB of memory on each GPU. I also checked that GPUs are ideal before running the script. Since the default batch size is 4 with 2 GPUs, I have this impression that with 4 GPUs I can have batch size 16. Plus, the time cost in batch size equal to 4 is 298.297750 which is slow for 1717 epoch and will take around 5 days. Is this a normal behavior?