bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 490 forks source link

The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured. #418

Open jackjinj opened 2 years ago

jackjinj commented 2 years ago

Describe the bug A clear and concise description of what the bug is. The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured. So the bpslaunch complain it can't find DMLC_WORKER_ID variable and error out.

To Reproduce Steps to reproduce the behavior:

  1. Prepared Kubernetes 1.19
  2. Installed kubeflow 1.2 which has mxjob operator
  3. Download the yaml from https://github.com/kubeflow/mxnet-operator/blob/master/examples/train/byteps_dist_gpu_v1.yaml
  4. kubectl apply -f byteps_dist_gpu_v1.yaml
  5. kubect get pod: byteps-mxnet-job-scheduler-0 1/1 Running 0 8s byteps-mxnet-job-server-0 1/1 Running 0 8s byteps-mxnet-job-server-1 1/1 Running 0 8s byteps-mxnet-job-worker-0 0/1 Completed 0 8s byteps-mxnet-job-worker-1 0/1 Completed 0 7s

$ kubectl describe pod byteps-mxnet-job-worker-0 You can see DMLC_WORKER_ID is not there DMLC_PS_ROOT_PORT: 9091 DMLC_PS_ROOT_URI: byteps-mxnet-job-scheduler-0 DMLC_NUM_SERVER: 2 DMLC_NUM_WORKER: 2 DMLC_ROLE: worker DMLC_USE_KUBERNETES: 1

To reproduce it inside the Pod, you can modify the yaml as below to let the Pod run without running bpslanuch command: ["/bin/bash", "-c"] args: [ "sleep 3600" ]

command: ["bpslaunch"]

args: ["python3", "/usr/local/byteps/example/mxnet/train_imagenet_byteps.py", "--benchmark", "1", "--batch-size=32"]

Then apply the yaml to let the Pod run: byteps-mxnet-job-server-0 1/1 Running 0 15s byteps-mxnet-job-server-1 1/1 Running 0 15s byteps-mxnet-job-worker-0 1/1 Running 0 15s byteps-mxnet-job-worker-1 1/1 Running 0 14s

Then login as below: $ kubectl exec -it byteps-mxnet-job-worker-0 -- bash root@byteps-mxnet-job-worker-0:/# root@byteps-mxnet-job-worker-0:/# env |grep DMLC_WORKER_ID root@byteps-mxnet-job-worker-0:/# bpslaunch BytePS launching worker The env DMLC_WORKER_ID is missing

Expected behavior A clear and concise description of what you expected to happen. Expect to see the worker pod running

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context Add any other context about the problem here.

If I need to run Pytorch DDP with byteps in kubernetes platform, do I still have to use mxjob operator? or I can use PytorchJob operator?

Thanks

Jack