IntelLabs / coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
https://intellabs.github.io/coach/
Apache License 2.0
2.32k stars 461 forks source link

Add RedisDataStore #295

Closed zach-nervana closed 5 years ago

zach-nervana commented 5 years ago

The good news is that the previous bottleneck in serializing, communication and deserialization of the policies is no longer the primary bottleneck. The bad news is that given the inefficiencies of the emulate_* code path, we are still unable to benefit significantly from additional rollout workers in the case of PPO on the humanoid environment. It is likely that more difficult to simulate environments will see more benefit with this set up as is. More details are below.

In order to run this code, run make distributed from the docker directory. You must also have access to a kubernetes cluster configured locally and coach installed locally.

2019-04-18-20:28:30.999694 Testing - Name: main_level/agent Worker: 0 Episode: 4927 Total reward: 253.96 Steps: 287237 Training iteration: 57
agent: Finished evaluation phase. Success rate = 0.0, Avg Total Reward = 253.96
2019-04-18-20:28:31.177425 Training - Name: main_level/agent Worker: 0 Episode: 4928 Total reward: 0 Steps: 287295 Training iteration: 57
2019-04-18-20:28:31.245510 Training - Name: main_level/agent Worker: 0 Episode: 4929 Total reward: 0 Steps: 287373 Training iteration: 57
2019-04-18-20:28:31.328241 Training - Name: main_level/agent Worker: 0 Episode: 4930 Total reward: 0 Steps: 287441 Training iteration: 57
2019-04-18-20:28:31.386314 Training - Name: main_level/agent Worker: 0 Episode: 4931 Total reward: 0 Steps: 287497 Training iteration: 57
2019-04-18-20:28:31.455898 Training - Name: main_level/agent Worker: 0 Episode: 4932 Total reward: 0 Steps: 287554 Training iteration: 57
2019-04-18-20:28:31.518685 Training - Name: main_level/agent Worker: 0 Episode: 4933 Total reward: 0 Steps: 287616 Training iteration: 57
2019-04-18-20:28:31.580325 Training - Name: main_level/agent Worker: 0 Episode: 4934 Total reward: 0 Steps: 287672 Training iteration: 57
2019-04-18-20:28:31.646645 Training - Name: main_level/agent Worker: 0 Episode: 4935 Total reward: 0 Steps: 287749 Training iteration: 57
2019-04-18-20:28:31.709180 Training - Name: main_level/agent Worker: 0 Episode: 4936 Total reward: 0 Steps: 287813 Training iteration: 57
2019-04-18-20:28:31.771553 Training - Name: main_level/agent Worker: 0 Episode: 4937 Total reward: 0 Steps: 287868 Training iteration: 57
2019-04-18-20:28:31.833934 Training - Name: main_level/agent Worker: 0 Episode: 4938 Total reward: 0 Steps: 287918 Training iteration: 57
2019-04-18-20:28:31.897750 Training - Name: main_level/agent Worker: 0 Episode: 4939 Total reward: 0 Steps: 287995 Training iteration: 57
2019-04-18-20:28:31.963689 Training - Name: main_level/agent Worker: 0 Episode: 4940 Total reward: 0 Steps: 288051 Training iteration: 57
2019-04-18-20:28:32.033548 Training - Name: main_level/agent Worker: 0 Episode: 4941 Total reward: 0 Steps: 288110 Training iteration: 57
2019-04-18-20:28:32.107813 Training - Name: main_level/agent Worker: 0 Episode: 4942 Total reward: 0 Steps: 288168 Training iteration: 57
2019-04-18-20:28:32.185998 Training - Name: main_level/agent Worker: 0 Episode: 4943 Total reward: 0 Steps: 288270 Training iteration: 57

The total time between the end of the evaluation phase, and the processing of the first rollout in the next training phase is less than 200ms. This includes the time required to serialize the policy, send the policy from the training worker to the rollout worker, deserialize the policy and load into tensorflow, evaluate a rollout, serialize the episode transitions, send the episode transitions back to the master, and finally run these episodes through the master training emulate_* code path.

No longer is policy serialization, communication and deserialization the primary bottleneck. It appears that the primary bottleneck is now the serial processing of transitions by the training process. You can see above that the training worker requires approximately (32.185998 seconds - 31.177425 seconds) / (288270 steps - 287295 steps) * (1000 ms / 1 s) -> 1.03 ms / step. At 4000 steps_between_evaluationperiods, that comes out to around 4s per training period. This isn't even including computing gradients and updating policy weights. This is only in pushing the transitions through the `emulate*` code path. Luckily, with further refactoring, this should be able to be greatly reduced as the processing of these transitions does not need to happen sequentially as it does now. In theory, it should be possible to process them in parallel. I suspect there is also overhead coming from somewhere else. It should not take 1 ms to add a transition to a replay buffer, which is essentially all that is happening here.

gal-leibovich commented 5 years ago

Thanks a lot @zach-nervana for the detailed analysis! Resolving the communication and serialization bottlenecks are great improvements to the distributed runs code path.

Having a transition added to to the replay buffer taking 1 msec, indeed sounds like too much. It might be the result of the the mutexes laid out throughout the replay buffer code, used to allow multiple workers to read/write to the same buffer, but need to debug to make sure.