Closed HsiaoCH closed 5 years ago
Thank you for providing DSFD_tensorflow I have some problems during training: Use WIDER to do training. Run
python prepare_wider_data.py
produce train.txt and val.txt Runpython train.py
but it out of memory error2019-05-22 09:33:16.310098: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 53.70M (56313088 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-05-22 09:33:16.311168: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 53.70M (56313088 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-05-22 09:33:16.311568: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 53.70M (56313088 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,256,160,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node tower_0/ssd/dual/mul_2-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[{{node tower_0/gradients/AddN_339}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "train.py", line 7, in <module> trainner.train() File "/mnt/HDD_1/chinghua_shiaw/DSFD-tensorflow/net_work.py", line 430, in train self.train_loop() File "/mnt/HDD_1/chinghua_shiaw/DSFD-tensorflow/net_work.py", line 299, in train_loop self._train(train_ds,epoch) File "/mnt/HDD_1/chinghua_shiaw/DSFD-tensorflow/net_work.py", line 368, in _train feed_dict=feed_dict) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,256,160,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node tower_0/ssd/dual/mul_2-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[node tower_0/gradients/AddN_339 (defined at /mnt/HDD_1/chinghua_shiaw/DSFD-tensorflow/net_work.py:204) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
How can I solve this error? I hope to get your reply as soon as possible. Thank
hi, i dont know how much memory do u have, for 1080ti ,11G memory can run at batch_size 4 for resnet50,
i think u can set config.TRAIN.batch_size in train_config to a smaller one,
or u can try decrease the input size ,the three value to smaller, such as 320,448, or 512
config.DATA.hin = 640
config.DATA.win= 640
config.DATA.MAX_SIZE=640
My GPU
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2070 Off | 00000000:01:00.0 On | N/A |
| 35% 51C P0 69W / 175W | 807MiB / 7949MiB | 18% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2050 G /usr/lib/xorg/Xorg 24MiB |
| 0 2095 G /usr/bin/gnome-shell 11MiB |
| 0 2391 G /usr/lib/xorg/Xorg 312MiB |
| 0 2504 G /usr/bin/gnome-shell 148MiB |
| 0 4823 G /snap/pycharm-community/128/jre64/bin/java 3MiB |
| 0 15425 G ...uest-channel-token=18270327324524076502 98MiB |
| 0 26227 G /usr/lib/xorg/Xorg 42MiB |
| 0 26358 G /usr/bin/gnome-shell 151MiB |
+-----------------------------------------------------------------------------+
My GPU
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 2070 Off | 00000000:01:00.0 On | N/A | | 35% 51C P0 69W / 175W | 807MiB / 7949MiB | 18% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 2050 G /usr/lib/xorg/Xorg 24MiB | | 0 2095 G /usr/bin/gnome-shell 11MiB | | 0 2391 G /usr/lib/xorg/Xorg 312MiB | | 0 2504 G /usr/bin/gnome-shell 148MiB | | 0 4823 G /snap/pycharm-community/128/jre64/bin/java 3MiB | | 0 15425 G ...uest-channel-token=18270327324524076502 98MiB | | 0 26227 G /usr/lib/xorg/Xorg 42MiB | | 0 26358 G /usr/bin/gnome-shell 151MiB | +-----------------------------------------------------------------------------+
8G is insufficient for resnet50, i suggested that u try the mobilenet, use these lines in train_config,
download the pretrained model from url http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.5_160.tgz
Successfully started training
config.DATA.hin = 512
config.DATA.win= 512
config.DATA.MAX_SIZE=512
config.MODEL.pretrained_model='resnet_v1_50.ckpt'
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2070 Off | 00000000:01:00.0 On | N/A |
| 55% 68C P2 74W / 175W | 7894MiB / 7949MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Training process:
[2019-05-22 11:16:18,535] [INFO] epoch 0: iter 770, total_loss=17.277601 reg_loss=4.938671 cla_loss=4.456257 l2_loss=7.882674 learning rate =1.000000e-05 (7.5 examples/sec; 0.532 sec/batch) fetch data time = 0.005221run time = 0.532418
[2019-05-22 11:16:24,065] [INFO] epoch 0: iter 780, total_loss=16.717018 reg_loss=4.731086 cla_loss=4.103269 l2_loss=7.882665 learning rate =1.000000e-05 (7.4 examples/sec; 0.542 sec/batch) fetch data time = 0.005099run time = 0.542384
[2019-05-22 11:16:29,428] [INFO] epoch 0: iter 790, total_loss=16.067827 reg_loss=4.626087 cla_loss=3.559085 l2_loss=7.882656 learning rate =1.000000e-05 (7.6 examples/sec; 0.529 sec/batch) fetch data time = 0.005166run time = 0.528970
[2019-05-22 11:16:34,875] [INFO] epoch 0: iter 800, total_loss=18.018974 reg_loss=5.311465 cla_loss=4.824861 l2_loss=7.882649 learning rate =1.000000e-05 (7.7 examples/sec; 0.519 sec/batch) fetch data time = 0.005323run time = 0.518849
[2019-05-22 11:16:41,888] [INFO] epoch 0: iter 810, total_loss=16.127144 reg_loss=4.542943 cla_loss=3.701558 l2_loss=7.882642 learning rate =1.000000e-05 (6.9 examples/sec; 0.579 sec/batch) fetch data time = 0.006016run time = 0.579158
Is this the right process?
it looks like right, did u change the tensorpack code , othewise, it will break when the epoch is end
it looks like right, did u change the tensorpack code , othewise, it will break when the epoch is end
Thank you for your remind and guidance. 感謝您的提醒與指導。
Thank you for providing DSFD_tensorflow I have some problems during training: Use WIDER to do training. Run
python prepare_wider_data.py
produce train.txt and val.txt Runpython train.py
but it out of memory errorHow can I solve this error? I hope to get your reply as soon as possible. Thank