Describe the bug
A clear and concise description of what the bug is.
The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured. So the bpslaunch complain it can't find DMLC_WORKER_ID variable and error out.
$ kubectl describe pod byteps-mxnet-job-worker-0
You can see DMLC_WORKER_ID is not there
DMLC_PS_ROOT_PORT: 9091
DMLC_PS_ROOT_URI: byteps-mxnet-job-scheduler-0
DMLC_NUM_SERVER: 2
DMLC_NUM_WORKER: 2
DMLC_ROLE: worker
DMLC_USE_KUBERNETES: 1
To reproduce it inside the Pod, you can modify the yaml as below to let the Pod run without running bpslanuch
command: ["/bin/bash", "-c"]
args: [
"sleep 3600"
]
Then apply the yaml to let the Pod run:
byteps-mxnet-job-server-0 1/1 Running 0 15s
byteps-mxnet-job-server-1 1/1 Running 0 15s
byteps-mxnet-job-worker-0 1/1 Running 0 15s
byteps-mxnet-job-worker-1 1/1 Running 0 14s
Then login as below:
$ kubectl exec -it byteps-mxnet-job-worker-0 -- bash
root@byteps-mxnet-job-worker-0:/#
root@byteps-mxnet-job-worker-0:/# env |grep DMLC_WORKER_ID
root@byteps-mxnet-job-worker-0:/# bpslaunch
BytePS launching worker
The env DMLC_WORKER_ID is missing
Expected behavior
A clear and concise description of what you expected to happen.
Expect to see the worker pod running
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
OS:
GCC version:
CUDA and NCCL version:
Framework (TF, PyTorch, MXNet):
Additional context
Add any other context about the problem here.
If I need to run Pytorch DDP with byteps in kubernetes platform, do I still have to use mxjob operator? or I can use PytorchJob operator?
Describe the bug A clear and concise description of what the bug is. The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured. So the bpslaunch complain it can't find DMLC_WORKER_ID variable and error out.
To Reproduce Steps to reproduce the behavior:
$ kubectl describe pod byteps-mxnet-job-worker-0 You can see DMLC_WORKER_ID is not there DMLC_PS_ROOT_PORT: 9091 DMLC_PS_ROOT_URI: byteps-mxnet-job-scheduler-0 DMLC_NUM_SERVER: 2 DMLC_NUM_WORKER: 2 DMLC_ROLE: worker DMLC_USE_KUBERNETES: 1
To reproduce it inside the Pod, you can modify the yaml as below to let the Pod run without running bpslanuch command: ["/bin/bash", "-c"] args: [ "sleep 3600" ]
command: ["bpslaunch"]
args: ["python3", "/usr/local/byteps/example/mxnet/train_imagenet_byteps.py", "--benchmark", "1", "--batch-size=32"]
Then apply the yaml to let the Pod run: byteps-mxnet-job-server-0 1/1 Running 0 15s byteps-mxnet-job-server-1 1/1 Running 0 15s byteps-mxnet-job-worker-0 1/1 Running 0 15s byteps-mxnet-job-worker-1 1/1 Running 0 14s
Then login as below: $ kubectl exec -it byteps-mxnet-job-worker-0 -- bash root@byteps-mxnet-job-worker-0:/# root@byteps-mxnet-job-worker-0:/# env |grep DMLC_WORKER_ID root@byteps-mxnet-job-worker-0:/# bpslaunch BytePS launching worker The env DMLC_WORKER_ID is missing
Expected behavior A clear and concise description of what you expected to happen. Expect to see the worker pod running
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context Add any other context about the problem here.
If I need to run Pytorch DDP with byteps in kubernetes platform, do I still have to use mxjob operator? or I can use PytorchJob operator?
Thanks
Jack