Open LinasVidziunas opened 2 years ago
New from this status report
Model.Predict()
Model.predict()
returns expected results even though it produces warnings about auto-shard (EDITED)Screenshots on messenger show 1 vs 3 GPUs epoch time for 20 epochs using the cAE on the main branch. 3 GPUs were approximately 1.8 times faster than 1 for the whole processes from start to finish. Per epoch times were reduced from 22s to 9s, 2.4 times faster. With more epochs or just larger networks I would expect the time for the whole process with 3 GPUs be closer to 2.4 times faster than one GPU.
On calling Model.predict()
in cAE.py produces a warning, telling us auto-sharding should either be disabled or set to AutoShardPolicy.DATA
. This is already done on the dataset! I have a suspicion that the warnings come from the internal Model.predict()
, which can be overwritten, but it seems like hell to do it! Preferably the warnings should be disabled, but if the Model.predict()
works as expected it might not be worth spending time on rewriting the Model.predict()
function.
I'm currently running a test with 200 epochs to check whether Model.predict()
returns expect results even though warnings are thrown.
After running 200 epochs Model.predict()
seems to return the expected results. I will call this a success, but keep in mind that this approach might be prone to errors later on. Checkboxes above have been edited correspondingly to this observation.
Changed priority of this issue: Priority 3
-> Priority 4
, due to currently not being of any significant importance.
Reserved for status report.
Multiple GPUs
Currently, even though multiple GPUs are assigned to the project we don't see any time improvements in the training. Actually we see that the time gets worse by ~1 second for each epoch. This might become more important once we incorporate multiple views and/or more complex models, as the training time should increase.
Suggestion
Implement what's called synchronous data parallelism, where a single model gets replicated on multiple devices or multiple machines. Each of them processes different batches of data, then they merge their results[2] Implementing this is not as easy as it seems in the first glance, as our datasets (x_train, x_test) have to be made into distributed datasets with tf.data.Dataset objects, as mentioned: "Importantly, we recommend that you use tf.data.Dataset objects to load data in a multi-device or distributed workflow."[2] Without datasets as tf.data.Dataset objects, alot of warnings get displayed before training starts. And also the training times seem to become way worse! (very bad) More information about creation of distributed datasets can be found in TF documentation chapter Custom training with tf.distribute.Strategy [4] under section Setup input pipeline
Quick Recap:
Further more
Resources