I am migrating from keras-retinanet 0.4.1 (yeah, I know) to 0.5.1. Release of this version just can't start with the same problem as described in #1171. But solution there is clumsy (and a bit outdated now anyway as far as I understand). So I tried to use the latest master branch code and there is no errors, I can start training successfully. But, strangely, training time on the same machine with the same data has become slower comparing to 0.4.1. I ran the command
GPU-Z screenshot during 0.4.1 training with tensorflow 1.10.0 and CUDA 9.0:
GPU-Z screenshot during master branch training with tensorflow 2.3.0 and CUDA 10.1:
OS: Windows 10
Python: 3.6.8
Graphics card: Nvidia GeForce GTX 1080 Ti
You can clearly see that 0.4.1 training utilize GPU more effectively (look at the GPU load and Board power draw graphs). One epoch of training is taking 1131 seconds (511 ms/step) with the master branch code and 991 seconds (448 ms/step) with 0.4.1 version. It's more than a 14% increase at training time. Is it normal behavior or am I missing something? I imagined with all the optimization fixes going on in the project training would be faster or take the same time at least. If it's the consequences of using non-release code, could you please address the problem from #1171 then?
This issue has been automatically marked as stale due to the lack of recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello
I am migrating from keras-retinanet 0.4.1 (yeah, I know) to 0.5.1. Release of this version just can't start with the same problem as described in #1171. But solution there is clumsy (and a bit outdated now anyway as far as I understand). So I tried to use the latest master branch code and there is no errors, I can start training successfully. But, strangely, training time on the same machine with the same data has become slower comparing to 0.4.1. I ran the command
with both versions and here are the results.
GPU-Z screenshot during 0.4.1 training with tensorflow 1.10.0 and CUDA 9.0:
GPU-Z screenshot during master branch training with tensorflow 2.3.0 and CUDA 10.1:
OS: Windows 10 Python: 3.6.8 Graphics card: Nvidia GeForce GTX 1080 Ti
You can clearly see that 0.4.1 training utilize GPU more effectively (look at the GPU load and Board power draw graphs). One epoch of training is taking 1131 seconds (511 ms/step) with the master branch code and 991 seconds (448 ms/step) with 0.4.1 version. It's more than a 14% increase at training time. Is it normal behavior or am I missing something? I imagined with all the optimization fixes going on in the project training would be faster or take the same time at least. If it's the consequences of using non-release code, could you please address the problem from #1171 then?