Facing OOM error for training and slow training

I have four NVIDIA 1080TI and I'm running the training script by python train.py configs/car_auto_T3_train_train_config configs/car_auto_T3_train_config --dataset_root_dir KITTI/ with batch_size=16 and NUM_GPU=4 . After a few minutes I get OOM error with the following stats:

tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit:                 10983519028 
InUse:                  9621110016
MaxInUse:               9992921344
NumAllocs:                   31169
MaxAllocSize:            732806656

I tried batch_size=8 also same thing happens but when I chose batch_size=4 it works and allocates around 8.5 GB of memory on each GPU. I also checked that GPUs are ideal before running the script. Since the default batch size is 4 with 2 GPUs, I have this impression that with 4 GPUs I can have batch size 16. Plus, the time cost in batch size equal to 4 is 298.297750 which is slow for 1717 epoch and will take around 5 days. Is this a normal behavior?

WeijingShi / Point-GNN

Facing OOM error for training and slow training #40