Running DDQNBCQ in a Spark Cluster: why the traninig time is increased?

Hello everyone,

Currently, we are attempting to create a new method to setup multinode processing in a Spark Cluster with GPUs to reduce training time [1, 2]. To validate the hypothesis that the changes needed were only in the communication protocol between Tensorflow Server and the clusters (which if validated, will be straightforward to implement), we developed a method to distribute the workload on a 96 cores single node cluster.

With our experiment, we were not able to validate our hypothesis and found that Tensorflow Server was just replicating the same process in all threads, using all datasets in each one, instead of smart usage of the batches, leading just to a multiplication of the training time, contrary to our expectations and the references provided above.

Moreover, this increase in training time was found also using the native inbuilt K8s deployment but we are not sure that our K8s setup was properly done and we will repeat the test in a self-managed K8s setup in AWS using AWS implementation/wrapper of Coach.

We have two possible explanations for these findings:

We are using TensorFlow Server wrongly and this fork could help (related with this MR).
The TensorFlow Server alone is not able to distribute the workload and the Kubernetes Orchestrator plays a role in the communication/sync methods. In this case, AWS implementation will show a reduction in the training time.

IntelLabs / coach

Running DDQNBCQ in a Spark Cluster: why the traninig time is increased? #462