gpu don‘t work - Githubissues

zhangke1997 commented 4 years ago

1.python keras_retinanet/bin/train.py --epochs 50 --step 4000 --batch-size 4 --gpu 2 pascal /home/lyp/disk3/zk/code/keras-retinanet/data/VOC2007

train work but gpu is 0%

2.python keras_retinanet/bin/train.py --epochs 50 --step 4000 --batch-size 4 --multi-gpu 2 --multi-gpu-force pascal /home/lyp/disk3/zk/code/keras-retinanet/data/VOC2007/

train don't work and warning that:

ValueError: To call multi_gpu_model with gpus=2, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1']. However this machine only has: ['/cpu:0', '/xla_cpu:0']. Try reducing gpus.

however , i have three 2080Ti

zhangke1997 commented 4 years ago

anyone else?

zhangke1997 commented 4 years ago

plz help me,

Sehjbir commented 4 years ago

@zhangke1997 please run nvidia-smi on terminal to check the GPU ID. In my case it was either 0 or 1. in your case you have gpu:0 and gpu:1. please update the flag --gpu to --gpu 0 or --gpu 1.

python keras_retinanet/bin/train.py --epochs 50 --step 4000 --batch-size 4 --gpu 0 pascal /home/lyp/disk3/zk/code/keras-retinanet/data/VOC2007 should work !

zhangke1997 commented 4 years ago

ok thanks ! i solved it . i find my tensorflow-gpu == 2.3.0 is supported by cuda 10.1,but i'm 10.0 ,now it worked! thx!
However, i find single gpu's memory is not enough,i try to use two gpu by multi-gpu-force. but i don't know it's Potential problems，when two gpus working,it mentions that : ++++++++++++++++++++++++++++++++++++++++++++++++++ Epoch 1/50 2020-09-11 13:11:53.124561: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2020-09-11 13:11:55.405667: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 613/4000 [===>..........................] - ETA: 22:36 - regression_loss: 2.5564 - classification_loss: 0.8669 - loss: 3.42332020-09-11 13:16:03.680010: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.96GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-09-11 13:16:03.848352: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.67GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 1251/4000 [========>.....................] - ETA: 17:06 - regression_loss: 2.3190 - classification_loss: 0.7578 - loss: 3.0768WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 200000 batches). You may need to use the repeat() function when building your dataset. Running network: 100% (2510 of 2510) |#############################################################################################################| Elapsed Time: 0:05:52 Time: 0:05:52 Parsing annotations: 100% (2510 of 2510) |#########################################################################################################| Elapsed Time: 0:00:00 Time: 0:00:00 175 instances of class aeroplane with average precision: 0.0107 216 instances of class bicycle with average precision: 0.0136 305 instances of class bird with average precision: 0.0125 190 instances of class boat with average precision: 0.0005 296 instances of class bottle with average precision: 0.0007 141 instances of class bus with average precision: 0.0134 818 instances of class car with average precision: 0.2729 198 instances of class cat with average precision: 0.0882 706 instances of class chair with average precision: 0.0397 171 instances of class cow with average precision: 0.0166 162 instances of class diningtable with average precision: 0.0028 267 instances of class dog with average precision: 0.0641 199 instances of class horse with average precision: 0.0655 197 instances of class motorbike with average precision: 0.0394 2742 instances of class person with average precision: 0.3257 320 instances of class pottedplant with average precision: 0.0006 162 instances of class sheep with average precision: 0.0031 207 instances of class sofa with average precision: 0.0354 170 instances of class train with average precision: 0.0163 176 instances of class tvmonitor with average precision: 0.0058 mAP: 0.0514

Epoch 00001: saving model to ./snapshots/resnet50_pascal_01.h5 1251/4000 [========>.....................] - 833s 666ms/step - regression_loss: 2.3190 - classification_loss: 0.7578 - loss: 3.0768 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ then quit, it's my command :python keras_retinanet/bin/train.py --epochs 50 --step 4000 --batch-size 2 --multi-gpu-force --multi-gpu 2 pascal /home/lyp/disk3/zk/code/keras-retinanet/data/VOC2007

plz help me ,thx very much!

zhangke1997 commented 4 years ago

it seems memory in gpus is not uniform distribution. gpu 0：9g used ，gpu1 and gpu2：only 0.5g used

zhangke1997 commented 4 years ago

no , i think though multi-gpus==2 ,but it works only gpu==0 ,the other dont worked ,i dont know what reasons

stale[bot] commented 4 years ago

This issue has been automatically marked as stale due to the lack of recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

fizyr / keras-retinanet

gpu don‘t work #1459