IntelLabs / coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
https://intellabs.github.io/coach/
Apache License 2.0
2.33k stars 461 forks source link

Running DDQNBCQ in a Spark Cluster: why the traninig time is increased? #462

Open felipeeeantunes opened 4 years ago

felipeeeantunes commented 4 years ago

Hello everyone,

Currently, we are attempting to create a new method to setup multinode processing in a Spark Cluster with GPUs to reduce training time [1, 2]. To validate the hypothesis that the changes needed were only in the communication protocol between Tensorflow Server and the clusters (which if validated, will be straightforward to implement), we developed a method to distribute the workload on a 96 cores single node cluster.

With our experiment, we were not able to validate our hypothesis and found that Tensorflow Server was just replicating the same process in all threads, using all datasets in each one, instead of smart usage of the batches, leading just to a multiplication of the training time, contrary to our expectations and the references provided above.

Moreover, this increase in training time was found also using the native inbuilt K8s deployment but we are not sure that our K8s setup was properly done and we will repeat the test in a self-managed K8s setup in AWS using AWS implementation/wrapper of Coach.

We have two possible explanations for these findings: