Training Speed - Githubissues

djl11 commented 7 years ago

Hi Fabian, sorry this might not entirely constitute an "issue", but I was wondering if it is normal to have training times of 1-5 seconds per step? Meaning that 1000 steps of GD might take ~1 hour? This seems quite slow based on other networks I've trained (on my GTX 1080 GPU), particularly with a single image per batch, but perhaps this is entirely down to the significant number of parameters for this very deep model. Just wanted to make sure it seemed like everything was installed and running properly on my machine anyway as a sanity check! Thanks!

AngusG commented 7 years ago

This is expected as the train method is currently implemented in an online manner that fetches a single image at a time with opencv, I doubt it will ever converge as-is. This was discussed out of context in issue #6. I started implementing a pre-fetching pipeling in the prefetch branch, and currently have a script in master - im_to_tf_records.py for writing a dataset to a TFRecord. I expect to finish a pre-fetching mini-batch training implementation later this week.

AngusG commented 7 years ago

Training a VGG16 for classification in the way described above on a GTX680 runs at ~24 images per second once the queues are full. Using a Titan X gives up to 110 img/sec. Learning the deconvolution layers will slow things down a bit, but gives you a ballpark estimate.

djl11 commented 7 years ago

Great thanks, good to know! Although I checked with print statements and it seems to do all the hanging after the cv2.imread calls. For example, using time.time(), the imread methods take ~8ms to complete, whereas train_step.run() takes ~1-3 s often. Perhaps I have missed something, but does this not mean the slowdown has little to do with the online reading of data? All networks I have used so far have also accepted numpy arrays as input and seem to run much faster. Either way, it will be interesting to see if your pre-fetching implementation improves things!

AngusG commented 7 years ago

I just pushed a working mini-batch implementation in this commit. It is still far too slow (1.7 img/sec on Titan X), but I thought I would push it as it does technically work and now we can look at optimizing.

Once you have created your TFRecord as per README, just run python DeconvNetPipeline.py

You should see: 2016-12-07 11:08:21.338887: step 1, loss = 73533241721618432.00 (1.0 examples/sec; 9.821 sec/batch) 2016-12-07 11:08:27.556023: step 2, loss = 56378197448589312.00 (1.6 examples/sec; 6.216 sec/batch) 2016-12-07 11:08:33.398133: step 3, loss = 70547423407112192.00 (1.7 examples/sec; 5.841 sec/batch) 2016-12-07 11:08:39.244268: step 4, loss = 63756148104232960.00 (1.7 examples/sec; 5.845 sec/batch) 2016-12-07 11:08:45.077055: step 5, loss = 60037677088505856.00 (1.7 examples/sec; 5.832 sec/batch)

djl11 commented 7 years ago

With regards to performance, as far as I can tell, the main bottleneck is a result of the sparse tensor re-order and sparse-to-dense operations, which are both performed on the CPU it seems. Here is the resulting CUPTI GPU trace for a single training step using the max_unpool method. The gray and purple are the reorder and sparse-to-dense operations respectively. chrome___tracing.pdf

fabianbormann commented 7 years ago

Hi Daniel, I had also suspected that the unpool method is the bottleneck, but I had no idea that the method is only registered for DEVICE_CPU ! I am going to open an issue in the tensorflow repo soon. Thanks for sharing that helpful pdf!!

djl11 commented 7 years ago

No problem! And in case you weren't aware (I wasn't until yesterday), GPU tracing is pretty easy to implement when trying to optimize training, as explained by prb12's first comment on this tensorflow issue: https://github.com/tensorflow/tensorflow/issues/1824

And yes, good idea with posting the new tensorflow issue!

yossitsadok commented 7 years ago

Hi Daniel and Fabian,

I run "python DeconvNetPipeline.py" as suggested above and it seems that i'm getting a ridiculous speed:

2017-01-19 15:55:59.888283: step 1, loss = 68723015789051904.00 (0.1 examples/sec; 73.821 sec/batch) 2017-01-19 15:57:07.624283: step 2, loss = 60252820590297088.00 (0.1 examples/sec; 67.736 sec/batch) 2017-01-19 15:58:15.219283: step 3, loss = 51427684999233536.00 (0.1 examples/sec; 67.585 sec/batch) 2017-01-19 15:59:21.545283: step 4, loss = 67058380954402816.00 (0.2 examples/sec; 66.326 sec/batch)

I was using the default input arguments (batch size=10 etc.) and I'm using a single GTX Titan X card running on windows 7. Other nets, i tested on this machine, run fast as they should.

Are there any ideas how to work around this problem? Thanks

djl11 commented 7 years ago

I was using a batch size of 1 when getting my 1-5 second speeds, so your results are not entirely surprising. So your slow speed is still predominantly to do with the sparse tensor re-order and sparse-to-dense operations I think.

If you can find a way to re-implement the unpooling method without these operations, that would be your best bet. Unfortunately, I am no longer working on this issue (strided convolutions/deconvolutions rather than pooling/unpooling worked well for my network). Let me know if you have any luck anyway!

yossitsadok commented 7 years ago

OK, I can try re-implement the unpooling... One more question please, is there any reason why batch normalization is not implemented in the provided deconvnet model?

stridom commented 6 years ago

Hello everyone ! Do you have the Well trained models . I want to see the the effect of the network . As you all know, the network is training very slowly . . .Thanks ! !

fabianbormann / Tensorflow-DeconvNet-Segmentation

Training Speed #12