experiencor / keras-yolo3

Training and Detecting Objects with YOLO3
MIT License
1.6k stars 861 forks source link

Training on images with resolution 2288 × 1770 floods GPU memory even at batch_size=8 #175

Open hyejung-rachel opened 5 years ago

hyejung-rachel commented 5 years ago

`2019-03-12 15:23:38.002612: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 14.42GiB 2019-03-12 15:23:38.002618: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats: Limit: 15866508084 InUse: 15487862272 MaxInUse: 15501256704 NumAllocs: 2093 MaxAllocSize: 2034522624

2019-03-12 15:23:38.002688: W tensorflow/core/common_runtime/bfcallocator.cc:279] ******* 2019-03-12 15:23:38.002714: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at transpose_op.cc:199 : Resource exhausted: OOM when allocating tensor with shape[4,32,1952,1952] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Using TensorFlow backend. /anaconda/envs/py35/lib/python3.5/site-packages/keras/callbacks.py:999: UserWarning: epsilon argument is deprecated and will be removed, use min_delta instead. warnings.warn('epsilon argument is deprecated and ' Traceback (most recent call last): File "train.py", line 280, in main(args) File "train.py", line 257, in main max_queue_size = 8 File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, *kwargs) File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/engine/training.py", line 1415, in fit_generator initial_epoch=initial_epoch) File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/engine/training_generator.py", line 213, in fit_generator class_weight=class_weight) File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/engine/training.py", line 1215, in train_on_batch outputs = self.train_function(ins) File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2666, in call return self._call(inputs) File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2636, in _call fetched = self._callable_fn(array_vals) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1454, in call self._session._session, self._handle, args, status, None) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,32,1952,1952] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: replica_0/model_1/leaky_0/LeakyRelu/mul = Mul[T=DT_FLOAT, _class=["loc:@train.../Reshape_1"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/leaky_80/LeakyRelu/alpha, replica_0/model_1/bnorm_0/cond/Merge)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. `

It is not possible to reduce the resolution of the images I am training on. Can I use this model to train? Because trying to reduce the batch size doesn't work. Nor reducing the data. Thank you!

sahilrider commented 5 years ago

What's your system configuration?

Alex-Naxitus commented 5 years ago

Helllo, facing the same problem when training with the kangaroo data set . Reducing training bach from 16 to 8 and 4 has not changed anything CPU 8Gb + ( Intel(R) UHD Graphics 630 GPU Geforce GTX 1050 3GB 'extension up to 8GB), Windows10 + Anaconda

from import tensorflow as tf from tensorflow.python.client import device_lib print(device_lib.list_localdevices()) I have the response GPU:0 with 2131 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1) [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 18102239670215265869 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 2235275673 locality { bus_id: 1 links { } } incarnation: 6041356209009565047 physical_device_desc: "device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1" ]

Thanks for your help and advise ....

Spawnfile commented 4 years ago

I think your model is too heavy to being training on your GPU. Quick comparison for you, 1050 TI tested on batch_size:1 on 9.500 pics dataset. 2, 4 or 8 OK maybe.