canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
99 stars 48 forks source link

Paddle Paddle does not work in user namespace #610

Open Barteus opened 1 year ago

Barteus commented 1 year ago

Reproduce

  1. Install CKF 1.7
  2. Run paddle-paddle example in kubeflow namespace - WORKS!
  3. Run paddle-paddle example in user namespace - hangs

The example used: https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/paddlepaddle/simple-cpu.yaml

Pods:

$ kubectl get po -A | grep paddle
admin                              paddle-simple-cpu-worker-0                              2/2     Running            0                 8m6s
admin                              paddle-simple-cpu-worker-1                              2/2     Running            0                 8m5s
kubeflow                           paddle-simple-cpu-worker-0                              0/1     Completed          0                 14m
kubeflow                           paddle-simple-cpu-worker-1                              0/1     Completed          0                 14m

Logs from Pods in 'kubeflow' namespace:

$ kubectl logs paddle-simple-cpu-worker-0 -n kubeflow
LAUNCH INFO 2023-06-02 09:57:50,108 Paddle Distributed Test begin...
LAUNCH INFO 2023-06-02 09:57:50,115 -----------  Configuration  ----------------------
LAUNCH INFO 2023-06-02 09:57:50,115 devices: None
LAUNCH INFO 2023-06-02 09:57:50,115 elastic_level: -1
LAUNCH INFO 2023-06-02 09:57:50,115 elastic_timeout: 30
LAUNCH INFO 2023-06-02 09:57:50,115 gloo_port: 6767
LAUNCH INFO 2023-06-02 09:57:50,115 host: None
LAUNCH INFO 2023-06-02 09:57:50,115 ips: None
LAUNCH INFO 2023-06-02 09:57:50,115 job_id: paddle-simple-cpu
LAUNCH INFO 2023-06-02 09:57:50,115 legacy: False
LAUNCH INFO 2023-06-02 09:57:50,115 log_dir: log
LAUNCH INFO 2023-06-02 09:57:50,115 log_level: INFO
LAUNCH INFO 2023-06-02 09:57:50,115 master: 192.168.83.48:37777
LAUNCH INFO 2023-06-02 09:57:50,115 max_restart: 3
LAUNCH INFO 2023-06-02 09:57:50,115 nnodes: 2
LAUNCH INFO 2023-06-02 09:57:50,115 nproc_per_node: None
LAUNCH INFO 2023-06-02 09:57:50,115 rank: -1
LAUNCH INFO 2023-06-02 09:57:50,116 run_mode: collective
LAUNCH INFO 2023-06-02 09:57:50,116 server_num: None
LAUNCH INFO 2023-06-02 09:57:50,116 servers: 
LAUNCH INFO 2023-06-02 09:57:50,116 start_port: 6070
LAUNCH INFO 2023-06-02 09:57:50,116 trainer_num: None
LAUNCH INFO 2023-06-02 09:57:50,116 trainers: 
LAUNCH INFO 2023-06-02 09:57:50,116 training_script: /usr/local/lib/python3.7/dist-packages/paddle/distributed/launch/plugins/test.py
LAUNCH INFO 2023-06-02 09:57:50,116 training_script_args: []
LAUNCH INFO 2023-06-02 09:57:50,116 with_gloo: 1
LAUNCH INFO 2023-06-02 09:57:50,116 --------------------------------------------------
LAUNCH INFO 2023-06-02 09:57:50,116 Job: paddle-simple-cpu, mode collective, replicas 2[2:2], elastic False
LAUNCH INFO 2023-06-02 09:57:50,116 Waiting peer start...
LAUNCH INFO 2023-06-02 09:57:51,191 Run Pod: cfjjuo, replicas 1, status ready
LAUNCH INFO 2023-06-02 09:57:51,199 Watching Pod: cfjjuo, replicas 1, status running
LAUNCH INFO 2023-06-02 09:58:02,210 Pod completed
LAUNCH INFO 2023-06-02 09:58:02,710 Exit code 0
LAUNCH WARNNING args master is override by env 192.168.83.48:37777
LAUNCH WARNNING args nnodes is override by env 2
LAUNCH WARNNING args job_id is override by env paddle-simple-cpu
Prepare distributed training with 2 nodes 1 cards
I0602 09:57:52.579567    25 tcp_utils.cc:181] The server starts to listen on IP_ANY:60922
I0602 09:57:52.579725    25 tcp_utils.cc:130] Successfully connected to 192.168.83.48:60922
2023-06-02 09:57:58,707-INFO: [topology.py:187:__init__] HybridParallelInfo: rank_id: 0, mp_degree: 1, sharding_degree: 1, pp_degree: 1, dp_degree: 2, mp_group: [0],  sharding_group: [0], pp_group: [0], dp_group: [0, 1], check/clip group: [0]
Distributed training start...
/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/norm.py:676: UserWarning: When training, we now always track global mean and variance.
  "When training, we now always track global mean and variance.")
[Epoch 0, batch 0] loss: 5.64991, acc1: 0.00000, acc5: 0.00000
[Epoch 1, batch 0] loss: 64.50126, acc1: 0.00000, acc5: 0.00000
[Epoch 2, batch 0] loss: 64.00022, acc1: 0.00000, acc5: 0.00000
Distributed training completed
I0602 09:58:01.824779    41 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop
$ kubectl logs paddle-simple-cpu-worker-1 -n kubeflow
LAUNCH INFO 2023-06-02 09:57:50,106 Paddle Distributed Test begin...
LAUNCH INFO 2023-06-02 09:57:50,113 -----------  Configuration  ----------------------
LAUNCH INFO 2023-06-02 09:57:50,113 devices: None
LAUNCH INFO 2023-06-02 09:57:50,113 elastic_level: -1
LAUNCH INFO 2023-06-02 09:57:50,113 elastic_timeout: 30
LAUNCH INFO 2023-06-02 09:57:50,113 gloo_port: 6767
LAUNCH INFO 2023-06-02 09:57:50,113 host: None
LAUNCH INFO 2023-06-02 09:57:50,113 ips: None
LAUNCH INFO 2023-06-02 09:57:50,113 job_id: paddle-simple-cpu
LAUNCH INFO 2023-06-02 09:57:50,113 legacy: False
LAUNCH INFO 2023-06-02 09:57:50,113 log_dir: log
LAUNCH INFO 2023-06-02 09:57:50,113 log_level: INFO
LAUNCH INFO 2023-06-02 09:57:50,113 master: paddle-simple-cpu-worker-0:37777
LAUNCH INFO 2023-06-02 09:57:50,113 max_restart: 3
LAUNCH INFO 2023-06-02 09:57:50,113 nnodes: 2
LAUNCH INFO 2023-06-02 09:57:50,114 nproc_per_node: None
LAUNCH INFO 2023-06-02 09:57:50,114 rank: -1
LAUNCH INFO 2023-06-02 09:57:50,114 run_mode: collective
LAUNCH INFO 2023-06-02 09:57:50,114 server_num: None
LAUNCH INFO 2023-06-02 09:57:50,114 servers: 
LAUNCH INFO 2023-06-02 09:57:50,114 start_port: 6070
LAUNCH INFO 2023-06-02 09:57:50,114 trainer_num: None
LAUNCH INFO 2023-06-02 09:57:50,114 trainers: 
LAUNCH INFO 2023-06-02 09:57:50,114 training_script: /usr/local/lib/python3.7/dist-packages/paddle/distributed/launch/plugins/test.py
LAUNCH INFO 2023-06-02 09:57:50,114 training_script_args: []
LAUNCH INFO 2023-06-02 09:57:50,114 with_gloo: 1
LAUNCH INFO 2023-06-02 09:57:50,114 --------------------------------------------------
LAUNCH INFO 2023-06-02 09:57:50,114 Job: paddle-simple-cpu, mode collective, replicas 2[2:2], elastic False
LAUNCH INFO 2023-06-02 09:57:50,114 Waiting peer start...
LAUNCH INFO 2023-06-02 09:57:51,199 Run Pod: fmdmbm, replicas 1, status ready
LAUNCH INFO 2023-06-02 09:57:51,206 Watching Pod: fmdmbm, replicas 1, status running
LAUNCH INFO 2023-06-02 09:58:02,219 Pod completed
LAUNCH INFO 2023-06-02 09:58:02,219 Exit code 0
LAUNCH WARNNING args master is override by env paddle-simple-cpu-worker-0:37777
LAUNCH WARNNING args nnodes is override by env 2
LAUNCH WARNNING args job_id is override by env paddle-simple-cpu
Prepare distributed training with 2 nodes 1 cards
I0602 09:57:52.541127    24 tcp_utils.cc:107] Retry to connect to 192.168.83.48:60922 while the server is not yet listening.
I0602 09:57:55.541379    24 tcp_utils.cc:130] Successfully connected to 192.168.83.48:60922
2023-06-02 09:57:57,391-INFO: [topology.py:187:__init__] HybridParallelInfo: rank_id: 1, mp_degree: 1, sharding_degree: 1, pp_degree: 1, dp_degree: 2, mp_group: [1],  sharding_group: [1], pp_group: [1], dp_group: [0, 1], check/clip group: [1]
Distributed training start...
/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/norm.py:676: UserWarning: When training, we now always track global mean and variance.
  "When training, we now always track global mean and variance.")
[Epoch 0, batch 0] loss: 6.07217, acc1: 0.00000, acc5: 0.00000
[Epoch 1, batch 0] loss: 64.46606, acc1: 0.00000, acc5: 0.00000
[Epoch 2, batch 0] loss: 64.00018, acc1: 0.00000, acc5: 0.00000
Distributed training completed

Waiting Pods in the admin namespace:

$ kubectl logs paddle-simple-cpu-worker-0 -n admin
LAUNCH INFO 2023-06-02 10:01:50,267 Paddle Distributed Test begin...
LAUNCH INFO 2023-06-02 10:01:50,277 -----------  Configuration  ----------------------
LAUNCH INFO 2023-06-02 10:01:50,277 devices: None
LAUNCH INFO 2023-06-02 10:01:50,277 elastic_level: -1
LAUNCH INFO 2023-06-02 10:01:50,277 elastic_timeout: 30
LAUNCH INFO 2023-06-02 10:01:50,277 gloo_port: 6767
LAUNCH INFO 2023-06-02 10:01:50,277 host: None
LAUNCH INFO 2023-06-02 10:01:50,277 ips: None
LAUNCH INFO 2023-06-02 10:01:50,277 job_id: paddle-simple-cpu
LAUNCH INFO 2023-06-02 10:01:50,277 legacy: False
LAUNCH INFO 2023-06-02 10:01:50,277 log_dir: log
LAUNCH INFO 2023-06-02 10:01:50,277 log_level: INFO
LAUNCH INFO 2023-06-02 10:01:50,277 master: 192.168.83.48:37777
LAUNCH INFO 2023-06-02 10:01:50,277 max_restart: 3
LAUNCH INFO 2023-06-02 10:01:50,277 nnodes: 2
LAUNCH INFO 2023-06-02 10:01:50,277 nproc_per_node: None
LAUNCH INFO 2023-06-02 10:01:50,277 rank: -1
LAUNCH INFO 2023-06-02 10:01:50,277 run_mode: collective
LAUNCH INFO 2023-06-02 10:01:50,277 server_num: None
LAUNCH INFO 2023-06-02 10:01:50,277 servers: 
LAUNCH INFO 2023-06-02 10:01:50,277 start_port: 6070
LAUNCH INFO 2023-06-02 10:01:50,277 trainer_num: None
LAUNCH INFO 2023-06-02 10:01:50,277 trainers: 
LAUNCH INFO 2023-06-02 10:01:50,278 training_script: /usr/local/lib/python3.7/dist-packages/paddle/distributed/launch/plugins/test.py
LAUNCH INFO 2023-06-02 10:01:50,278 training_script_args: []
LAUNCH INFO 2023-06-02 10:01:50,278 with_gloo: 1
LAUNCH INFO 2023-06-02 10:01:50,278 --------------------------------------------------
LAUNCH INFO 2023-06-02 10:01:50,278 Job: paddle-simple-cpu, mode collective, replicas 2[2:2], elastic False
LAUNCH INFO 2023-06-02 10:01:50,278 Waiting peer start...
$ kubectl logs paddle-simple-cpu-worker-1 -n admin
LAUNCH INFO 2023-06-02 10:01:51,100 Paddle Distributed Test begin...
LAUNCH INFO 2023-06-02 10:01:51,107 -----------  Configuration  ----------------------
LAUNCH INFO 2023-06-02 10:01:51,107 devices: None
LAUNCH INFO 2023-06-02 10:01:51,107 elastic_level: -1
LAUNCH INFO 2023-06-02 10:01:51,107 elastic_timeout: 30
LAUNCH INFO 2023-06-02 10:01:51,107 gloo_port: 6767
LAUNCH INFO 2023-06-02 10:01:51,107 host: None
LAUNCH INFO 2023-06-02 10:01:51,107 ips: None
LAUNCH INFO 2023-06-02 10:01:51,107 job_id: paddle-simple-cpu
LAUNCH INFO 2023-06-02 10:01:51,107 legacy: False
LAUNCH INFO 2023-06-02 10:01:51,107 log_dir: log
LAUNCH INFO 2023-06-02 10:01:51,107 log_level: INFO
LAUNCH INFO 2023-06-02 10:01:51,107 master: paddle-simple-cpu-worker-0:37777
LAUNCH INFO 2023-06-02 10:01:51,107 max_restart: 3
LAUNCH INFO 2023-06-02 10:01:51,107 nnodes: 2
LAUNCH INFO 2023-06-02 10:01:51,107 nproc_per_node: None
LAUNCH INFO 2023-06-02 10:01:51,107 rank: -1
LAUNCH INFO 2023-06-02 10:01:51,107 run_mode: collective
LAUNCH INFO 2023-06-02 10:01:51,107 server_num: None
LAUNCH INFO 2023-06-02 10:01:51,107 servers: 
LAUNCH INFO 2023-06-02 10:01:51,107 start_port: 6070
LAUNCH INFO 2023-06-02 10:01:51,108 trainer_num: None
LAUNCH INFO 2023-06-02 10:01:51,108 trainers: 
LAUNCH INFO 2023-06-02 10:01:51,108 training_script: /usr/local/lib/python3.7/dist-packages/paddle/distributed/launch/plugins/test.py
LAUNCH INFO 2023-06-02 10:01:51,108 training_script_args: []
LAUNCH INFO 2023-06-02 10:01:51,108 with_gloo: 1
LAUNCH INFO 2023-06-02 10:01:51,108 --------------------------------------------------
LAUNCH INFO 2023-06-02 10:01:51,108 Job: paddle-simple-cpu, mode collective, replicas 2[2:2], elastic False
LAUNCH INFO 2023-06-02 10:01:51,108 Waiting peer start...
LAUNCH WARNING 2023-06-02 10:01:56,110 master not ready
LAUNCH WARNING 2023-06-02 10:02:01,212 master not ready
LAUNCH WARNING 2023-06-02 10:02:06,313 master not ready
LAUNCH WARNING 2023-06-02 10:02:11,418 master not ready
LAUNCH WARNING 2023-06-02 10:02:16,519 master not ready
LAUNCH WARNING 2023-06-02 10:02:21,623 master not ready
LAUNCH WARNING 2023-06-02 10:02:26,726 master not ready
..........
kimwnasptd commented 1 year ago

Thanks for the input @Barteus!

Looks like this is an Istio issue, since in the kubeflow namespace we don't have any Istio sidecars but in admin the Istio sidecar is injected. We'll need to fully understand the communication path from the PaddlePaddle Pods and (most probably) create corresponding AuthorizationPolicies to allow some of the communication.

Or understand if the upstream Kubeflow Training Operator project never expects to run these Jobs with Istio, which is the case with Katib Experiment CRs

kimwnasptd commented 1 year ago

Or another possible issue, which I need to confirm, is whether the Pods in the same namespace can't talk to each other (again due to Istio sidecars).

If that's the case though we'll need to prioritize this more, since it can be a blocker for some use-cases.

Linking also to some issues that seem to have relevant context https://github.com/canonical/seldon-core-operator/issues/110 https://github.com/canonical/seldon-core-operator/issues/109

NohaIhab commented 1 year ago

The istio sidecar being injected is the issue, training jobs don't work with istio sidecar injection. From the upstream documentation:

If you are using Kubeflow with Istio, you have to disable sidecar injection.

We should update our documentation to say that, too.

You can disable istio sidecar injection by running: yq -i '.spec.paddleReplicaSpecs.Worker.template.metadata.annotations."sidecar.istio.io/inject" = "false"' simple-cpu.yaml where simple-cpu.yaml is the PaddleJob definition This should then succeed in the user namespace.

kimwnasptd commented 1 year ago

After discussing with @NohaIhab we decided to submit a PR to the upstream docs to expose this detail. Noha found that this is already exposed for TFJobs, so we'll need to make sure this is exposed for PaddlePaddle as well https://www.kubeflow.org/docs/components/training/tftraining/

Ideally we'd like upstream to support running Training Jobs and Experiments alongside Istio https://github.com/kubeflow/training-operator/issues/1681 https://github.com/kubeflow/katib/issues/1638

kimwnasptd commented 1 year ago

Again, the issue is that upstream is not working with Istio sidecars. The above example mentioned in the issue was supposed to be run in the kubeflow namespace, and without a sidecar.

NohaIhab commented 1 year ago

To move forward, we will do the effort to include how to deploy PaddlePaddle and other jobs in non-kubeflow namespace. Removing the bug label and marking it as a docs issue.

ColmBhandal commented 11 months ago

Discussed this with Noha. It is expected behaviour of upstream Kubeflow to require disabling sidecar injection for any training operator jobs. This is documented in TensorFlow, but not in PaddlePaddle.

Therefore, this is a documentation issue with upstream Kubeflow.

We currently don't have any CKF docs on training operator (or PaddlePaddle). If we did, we'd need to document this (as we do in Katib)

So the only action to take for now is submit a PR to upstream docs.