Restore checkpoint on multi-node training

IntelLabs / coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms

https://intellabs.github.io/coach/

Apache License 2.0

2.32k stars 459 forks source link

Restore checkpoint on multi-node training #195

Closed galnov closed 5 years ago

galnov commented 5 years ago

-crd flag is not working for multi-node. Need to implement a mechanism to sync the checkpoint from the orchestrator into the rollout_workers and trainer for restore.