WeijingShi / Point-GNN

Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud, CVPR 2020.
MIT License
540 stars 112 forks source link

Facing OOM error for training and slow training #40

Closed Farzin-Negahbani closed 4 years ago

Farzin-Negahbani commented 4 years ago

I have four NVIDIA 1080TI and I'm running the training script by python train.py configs/car_auto_T3_train_train_config configs/car_auto_T3_train_config --dataset_root_dir KITTI/ with batch_size=16 and NUM_GPU=4 . After a few minutes I get OOM error with the following stats:

tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit:                 10983519028 
InUse:                  9621110016
MaxInUse:               9992921344
NumAllocs:                   31169
MaxAllocSize:            732806656

I tried batch_size=8 also same thing happens but when I chose batch_size=4 it works and allocates around 8.5 GB of memory on each GPU. I also checked that GPUs are ideal before running the script. Since the default batch size is 4 with 2 GPUs, I have this impression that with 4 GPUs I can have batch size 16. Plus, the time cost in batch size equal to 4 is 298.297750 which is slow for 1717 epoch and will take around 5 days. Is this a normal behavior?

WeijingShi commented 4 years ago

Hi @Farzin-Negahbani, Thanks for your interest. The default batch size is 4 on two GPUs (two samples per GPU). In your case, batch_size=8 and num_gpu=4 should work. Could you help to provide more information? During your training, is there any other process that consumes GPU memory already (when you check using nvidia-smi)? How many threads does your CPU support? Besides the batch_size and num_gpu, are there any other modifications? Thanks,