610265158 / DSFD-tensorflow

a tensorflow implement dsfd face detector
112 stars 31 forks source link

Insufficient memory #1

Closed HsiaoCH closed 5 years ago

HsiaoCH commented 5 years ago

Thank you for providing DSFD_tensorflow I have some problems during training: Use WIDER to do training. Run python prepare_wider_data.py produce train.txt and val.txt Run python train.py but it out of memory error

2019-05-22 09:33:16.310098: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 53.70M (56313088 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

2019-05-22 09:33:16.311168: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 53.70M (56313088 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-05-22 09:33:16.311568: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 53.70M (56313088 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,256,160,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node tower_0/ssd/dual/mul_2-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[{{node tower_0/gradients/AddN_339}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 7, in <module>
    trainner.train()
  File "/mnt/HDD_1/chinghua_shiaw/DSFD-tensorflow/net_work.py", line 430, in train
    self.train_loop()
  File "/mnt/HDD_1/chinghua_shiaw/DSFD-tensorflow/net_work.py", line 299, in train_loop
    self._train(train_ds,epoch)
  File "/mnt/HDD_1/chinghua_shiaw/DSFD-tensorflow/net_work.py", line 368, in _train
    feed_dict=feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,256,160,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node tower_0/ssd/dual/mul_2-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[node tower_0/gradients/AddN_339 (defined at /mnt/HDD_1/chinghua_shiaw/DSFD-tensorflow/net_work.py:204) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

How can I solve this error? I hope to get your reply as soon as possible. Thank

610265158 commented 5 years ago

Thank you for providing DSFD_tensorflow I have some problems during training: Use WIDER to do training. Run python prepare_wider_data.py produce train.txt and val.txt Run python train.py but it out of memory error

2019-05-22 09:33:16.310098: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 53.70M (56313088 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

2019-05-22 09:33:16.311168: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 53.70M (56313088 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-05-22 09:33:16.311568: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 53.70M (56313088 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,256,160,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
   [[{{node tower_0/ssd/dual/mul_2-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

   [[{{node tower_0/gradients/AddN_339}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 7, in <module>
    trainner.train()
  File "/mnt/HDD_1/chinghua_shiaw/DSFD-tensorflow/net_work.py", line 430, in train
    self.train_loop()
  File "/mnt/HDD_1/chinghua_shiaw/DSFD-tensorflow/net_work.py", line 299, in train_loop
    self._train(train_ds,epoch)
  File "/mnt/HDD_1/chinghua_shiaw/DSFD-tensorflow/net_work.py", line 368, in _train
    feed_dict=feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,256,160,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
   [[{{node tower_0/ssd/dual/mul_2-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

   [[node tower_0/gradients/AddN_339 (defined at /mnt/HDD_1/chinghua_shiaw/DSFD-tensorflow/net_work.py:204) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

How can I solve this error? I hope to get your reply as soon as possible. Thank

hi, i dont know how much memory do u have, for 1080ti ,11G memory can run at batch_size 4 for resnet50, i think u can set config.TRAIN.batch_size in train_config to a smaller one, or u can try decrease the input size ,the three value to smaller, such as 320,448, or 512 config.DATA.hin = 640
config.DATA.win= 640 config.DATA.MAX_SIZE=640

HsiaoCH commented 5 years ago

My GPU

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:01:00.0  On |                  N/A |
| 35%   51C    P0    69W / 175W |    807MiB /  7949MiB |     18%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2050      G   /usr/lib/xorg/Xorg                            24MiB |
|    0      2095      G   /usr/bin/gnome-shell                          11MiB |
|    0      2391      G   /usr/lib/xorg/Xorg                           312MiB |
|    0      2504      G   /usr/bin/gnome-shell                         148MiB |
|    0      4823      G   /snap/pycharm-community/128/jre64/bin/java     3MiB |
|    0     15425      G   ...uest-channel-token=18270327324524076502    98MiB |
|    0     26227      G   /usr/lib/xorg/Xorg                            42MiB |
|    0     26358      G   /usr/bin/gnome-shell                         151MiB |
+-----------------------------------------------------------------------------+
610265158 commented 5 years ago

My GPU

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:01:00.0  On |                  N/A |
| 35%   51C    P0    69W / 175W |    807MiB /  7949MiB |     18%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2050      G   /usr/lib/xorg/Xorg                            24MiB |
|    0      2095      G   /usr/bin/gnome-shell                          11MiB |
|    0      2391      G   /usr/lib/xorg/Xorg                           312MiB |
|    0      2504      G   /usr/bin/gnome-shell                         148MiB |
|    0      4823      G   /snap/pycharm-community/128/jre64/bin/java     3MiB |
|    0     15425      G   ...uest-channel-token=18270327324524076502    98MiB |
|    0     26227      G   /usr/lib/xorg/Xorg                            42MiB |
|    0     26358      G   /usr/bin/gnome-shell                         151MiB |
+-----------------------------------------------------------------------------+

8G is insufficient for resnet50, i suggested that u try the mobilenet, use these lines in train_config,

config.MODEL = edict()

config.MODEL.continue_train=False ### revover from a trained model

config.MODEL.model_path = './model/' # save directory

config.MODEL.net_structure='MobilenetV1' ######'resnet_v1_50,resnet_v1_101,mobilenet

config.MODEL.pretrained_model='mobilenet_v1_0.5_160.ckpt'

download the pretrained model from url http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.5_160.tgz

HsiaoCH commented 5 years ago

Successfully started training config.DATA.hin = 512
config.DATA.win= 512 config.DATA.MAX_SIZE=512 config.MODEL.pretrained_model='resnet_v1_50.ckpt'

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:01:00.0  On |                  N/A |
| 55%   68C    P2    74W / 175W |   7894MiB /  7949MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Training process:

[2019-05-22 11:16:18,535] [INFO] epoch 0: iter 770, total_loss=17.277601 reg_loss=4.938671 cla_loss=4.456257 l2_loss=7.882674 learning rate =1.000000e-05 (7.5 examples/sec; 0.532 sec/batch) fetch data time = 0.005221run time = 0.532418 
[2019-05-22 11:16:24,065] [INFO] epoch 0: iter 780, total_loss=16.717018 reg_loss=4.731086 cla_loss=4.103269 l2_loss=7.882665 learning rate =1.000000e-05 (7.4 examples/sec; 0.542 sec/batch) fetch data time = 0.005099run time = 0.542384 
[2019-05-22 11:16:29,428] [INFO] epoch 0: iter 790, total_loss=16.067827 reg_loss=4.626087 cla_loss=3.559085 l2_loss=7.882656 learning rate =1.000000e-05 (7.6 examples/sec; 0.529 sec/batch) fetch data time = 0.005166run time = 0.528970 
[2019-05-22 11:16:34,875] [INFO] epoch 0: iter 800, total_loss=18.018974 reg_loss=5.311465 cla_loss=4.824861 l2_loss=7.882649 learning rate =1.000000e-05 (7.7 examples/sec; 0.519 sec/batch) fetch data time = 0.005323run time = 0.518849 
[2019-05-22 11:16:41,888] [INFO] epoch 0: iter 810, total_loss=16.127144 reg_loss=4.542943 cla_loss=3.701558 l2_loss=7.882642 learning rate =1.000000e-05 (6.9 examples/sec; 0.579 sec/batch) fetch data time = 0.006016run time = 0.579158 

Is this the right process?

610265158 commented 5 years ago

it looks like right, did u change the tensorpack code , othewise, it will break when the epoch is end

HsiaoCH commented 5 years ago

it looks like right, did u change the tensorpack code , othewise, it will break when the epoch is end

Thank you for your remind and guidance. 感謝您的提醒與指導。