Borda / keras-yolo3

A Keras implementation of YOLOv3 (Tensorflow backend) a successor of qqwweee/keras-yolo3
MIT License
31 stars 10 forks source link

training model on second GPU, too low memory #7

Closed Borda closed 5 years ago

Borda commented 5 years ago

Running training on GPU machine where are two physical graphics cards but the rest one (index 0) is in use by another process from 99%, so I have set to use the second one, but somehow in the process, it is ignored and still asking default GPU card 0

export CUDA_VISIBLE_DEVICES=1
python3 scripts/training.py --path_dataset ~/Cache/Project_Video/DATASETS/ppl-detect-v2_temp/dataset.txt --path_weights ./model_data/tiny-yolo.h5 --path_anchors ./model_data/tiny-yolo_anchors.csv --path_output ./model_data --path_config ./model_data/train_tiny-yolo_ppl.yaml

failing message

2019-08-21 00:58:58.957373: W tensorflow/core/common_runtime/bfc_allocator.cc:319] *************************************************************************************************___
2019-08-21 00:58:58.957417: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at conv_ops.cc:486 : Resource exhausted: OOM when allocating tensor with shape[128,104,104,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "scripts/training.py", line 208, in <module>
    _main(**arg_params)
  File "scripts/training.py", line 200, in _main
    callbacks=[tb_logging, checkpoint, reduce_lr, early_stopping])
  File "/home/jb/.local/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/jb/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/jb/.local/lib/python3.6/site-packages/keras/engine/training_generator.py", line 217, in fit_generator
    class_weight=class_weight)
  File "/home/jb/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1217, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/jb/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/home/jb/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/home/jb/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[128,104,104,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node conv2d_3/convolution}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[loss_1/add_12/_1041]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[128,104,104,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node conv2d_3/convolution}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

irh avalaible GPUs:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:09:00.0 Off |                  N/A |
| 51%   57C    P8    39W / 260W |  10773MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:41:00.0 Off |                  N/A |
| 86%   78C    P2   124W / 260W |  10912MiB / 10986MiB |     45%      Default |
+-------------------------------+----------------------+----------------------+
Borda commented 5 years ago

See also:

Borda commented 5 years ago

The first round of failers was solved by lowering the batch size... Later the issue seems to be linked to unfreezing all layers https://github.com/Borda/keras-yolo3/blob/c3339b55d2654faff30f2b802b3dde3d9f41690a/scripts/training.py#L188

Borda commented 5 years ago

in the end, the decreasing batch size helps, see https://github.com/qqwweee/keras-yolo3/issues/284#issuecomment-449748523