Closed zhangke1997 closed 4 years ago
anyone else?
plz help me,
@zhangke1997 please run nvidia-smi on terminal to check the GPU ID. In my case it was either 0 or 1. in your case you have gpu:0 and gpu:1. please update the flag --gpu to --gpu 0 or --gpu 1.
python keras_retinanet/bin/train.py --epochs 50 --step 4000 --batch-size 4 --gpu 0 pascal /home/lyp/disk3/zk/code/keras-retinanet/data/VOC2007 should work !
ok thanks ! i solved it . i find my tensorflow-gpu == 2.3.0 is supported by cuda 10.1,but i'm 10.0 ,now it worked! thx!
However, i find single gpu's memory is not enough,i try to use two gpu by multi-gpu-force. but i don't know it's Potential problems,when two gpus working,it mentions that :
++++++++++++++++++++++++++++++++++++++++++++++++++
Epoch 1/50
2020-09-11 13:11:53.124561: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-11 13:11:55.405667: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
613/4000 [===>..........................] - ETA: 22:36 - regression_loss: 2.5564 - classification_loss: 0.8669 - loss: 3.42332020-09-11 13:16:03.680010: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.96GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-09-11 13:16:03.848352: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.67GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
1251/4000 [========>.....................] - ETA: 17:06 - regression_loss: 2.3190 - classification_loss: 0.7578 - loss: 3.0768WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs
batches (in this case, 200000 batches). You may need to use the repeat() function when building your dataset.
Running network: 100% (2510 of 2510) |#############################################################################################################| Elapsed Time: 0:05:52 Time: 0:05:52
Parsing annotations: 100% (2510 of 2510) |#########################################################################################################| Elapsed Time: 0:00:00 Time: 0:00:00
175 instances of class aeroplane with average precision: 0.0107
216 instances of class bicycle with average precision: 0.0136
305 instances of class bird with average precision: 0.0125
190 instances of class boat with average precision: 0.0005
296 instances of class bottle with average precision: 0.0007
141 instances of class bus with average precision: 0.0134
818 instances of class car with average precision: 0.2729
198 instances of class cat with average precision: 0.0882
706 instances of class chair with average precision: 0.0397
171 instances of class cow with average precision: 0.0166
162 instances of class diningtable with average precision: 0.0028
267 instances of class dog with average precision: 0.0641
199 instances of class horse with average precision: 0.0655
197 instances of class motorbike with average precision: 0.0394
2742 instances of class person with average precision: 0.3257
320 instances of class pottedplant with average precision: 0.0006
162 instances of class sheep with average precision: 0.0031
207 instances of class sofa with average precision: 0.0354
170 instances of class train with average precision: 0.0163
176 instances of class tvmonitor with average precision: 0.0058
mAP: 0.0514
Epoch 00001: saving model to ./snapshots/resnet50_pascal_01.h5 1251/4000 [========>.....................] - 833s 666ms/step - regression_loss: 2.3190 - classification_loss: 0.7578 - loss: 3.0768 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ then quit, it's my command :python keras_retinanet/bin/train.py --epochs 50 --step 4000 --batch-size 2 --multi-gpu-force --multi-gpu 2 pascal /home/lyp/disk3/zk/code/keras-retinanet/data/VOC2007
it's gpus' state: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:15:00.0 Off | N/A | |100% 83C P2 252W / 250W | 10776MiB / 11019MiB | 90% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 208... Off | 00000000:2D:00.0 On | N/A | | 18% 51C P8 2W / 250W | 562MiB / 11011MiB | 13% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce RTX 208... Off | 00000000:99:00.0 Off | N/A | | 36% 37C P8 1W / 250W | 310MiB / 11019MiB | 0% Default | +-------------------------------+----------------------+----------------------+
plz help me ,thx very much!
it seems memory in gpus is not uniform distribution. gpu 0:9g used ,gpu1 and gpu2:only 0.5g used
no , i think though multi-gpus==2 ,but it works only gpu==0 ,the other dont worked ,i dont know what reasons
This issue has been automatically marked as stale due to the lack of recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
1.python keras_retinanet/bin/train.py --epochs 50 --step 4000 --batch-size 4 --gpu 2 pascal /home/lyp/disk3/zk/code/keras-retinanet/data/VOC2007
train work but gpu is 0%
2.python keras_retinanet/bin/train.py --epochs 50 --step 4000 --batch-size 4 --multi-gpu 2 --multi-gpu-force pascal /home/lyp/disk3/zk/code/keras-retinanet/data/VOC2007/
train don't work and warning that:
ValueError: To call
multi_gpu_model
withgpus=2
, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1']. However this machine only has: ['/cpu:0', '/xla_cpu:0']. Try reducinggpus
.however , i have three 2080Ti