LinasVidziunas / Unsupervised-lesion-detection-with-multi-view-MRI-and-autoencoders

GNU General Public License v3.0
0 stars 0 forks source link

Multiple GPUS #15

Open LinasVidziunas opened 2 years ago

LinasVidziunas commented 2 years ago

Multiple GPUs

Currently, even though multiple GPUs are assigned to the project we don't see any time improvements in the training. Actually we see that the time gets worse by ~1 second for each epoch. This might become more important once we incorporate multiple views and/or more complex models, as the training time should increase.

Suggestion

Implement what's called synchronous data parallelism, where a single model gets replicated on multiple devices or multiple machines. Each of them processes different batches of data, then they merge their results[2] Implementing this is not as easy as it seems in the first glance, as our datasets (x_train, x_test) have to be made into distributed datasets with tf.data.Dataset objects, as mentioned: "Importantly, we recommend that you use tf.data.Dataset objects to load data in a multi-device or distributed workflow."[2] Without datasets as tf.data.Dataset objects, alot of warnings get displayed before training starts. And also the training times seem to become way worse! (very bad) More information about creation of distributed datasets can be found in TF documentation chapter Custom training with tf.distribute.Strategy [4] under section Setup input pipeline

Quick Recap:

Further more

Resources

  1. https://www.tensorflow.org/tutorials/distribute/keras#define_the_distribution_strategy
  2. https://keras.io/guides/distributed_training/#using-callbacks-to-ensure-fault-tolerance
  3. https://www.coursera.org/lecture/custom-distributed-training-with-tensorflow/custom-training-for-multiple-gpu-mirrored-strategy-EDiRd
  4. https://www.tensorflow.org/tutorials/distribute/custom_training
LinasVidziunas commented 2 years ago

Status report as of 31/01/22

New from this status report

Quantitative data

Screenshots on messenger show 1 vs 3 GPUs epoch time for 20 epochs using the cAE on the main branch. 3 GPUs were approximately 1.8 times faster than 1 for the whole processes from start to finish. Per epoch times were reduced from 22s to 9s, 2.4 times faster. With more epochs or just larger networks I would expect the time for the whole process with 3 GPUs be closer to 2.4 times faster than one GPU.

Problems

On calling Model.predict() in cAE.py produces a warning, telling us auto-sharding should either be disabled or set to AutoShardPolicy.DATA. This is already done on the dataset! I have a suspicion that the warnings come from the internal Model.predict(), which can be overwritten, but it seems like hell to do it! Preferably the warnings should be disabled, but if the Model.predict() works as expected it might not be worth spending time on rewriting the Model.predict() function. I'm currently running a test with 200 epochs to check whether Model.predict() returns expect results even though warnings are thrown.

EDIT 31/01/22 12:05

After running 200 epochs Model.predict() seems to return the expected results. I will call this a success, but keep in mind that this approach might be prone to errors later on. Checkboxes above have been edited correspondingly to this observation.

LinasVidziunas commented 2 years ago

Status update as of 07/02/22

Changed priority of this issue: Priority 3 -> Priority 4, due to currently not being of any significant importance.

LinasVidziunas commented 2 years ago

Reserved for status report.