Distributed coach stalls if number of workers is greater than number of available vCPUs.

geranim0 commented 5 years ago

In order for k8s not to put all pods on the same node I gave resource restrictions on pods with this code in kubernetes_orchestrator.py

resources=k8sclient.V1ResourceRequirements( requests={'cpu':'1'} ),

It works if I give a num_workers < vCPUs, stalls otherwise since there are pending pods to be created. Is this by design with the worker locks? What's the recommended approach?

gal-leibovich commented 5 years ago

@balajismaniam, @scttl, could you please take a look?

balajismaniam commented 5 years ago

Hi @geranim0, this is expected and your changes might cause some unexpected behavior depending upon how much resource you have. The trainer will not receive enough experience to start a training iteration even if one of the rollout worker doesn't work as expected. This widely varies depending upon how your platform is setup. As a result we have not specified any resource requirements.

If you want the pods to be spread among different nodes, Please look at the following: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity

By default, the kubernetes scheduler tries to spread the pods belonging to the same deployment: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/algorithm/priorities/selector_spreading.go Please check if this is disable in your cluster.

IntelLabs / coach

Distributed coach stalls if number of workers is greater than number of available vCPUs. #303