PaddlePaddle / PaddleCloud

PaddlePaddle Docker images and K8s operators for PaddleOCR/Detection developers to use on public/private cloud.
Apache License 2.0
281 stars 75 forks source link

Do we need paddlectl client once we have the kubernetes custom controller? #383

Open typhoonzero opened 6 years ago

typhoonzero commented 6 years ago

Once we have TPR/CRD declared resource:

apiVersion: paddlepaddle.org/v1
kind: TrainingJob
metadata:
  name: job-1
spec:
  image: "paddlepaddle/paddlecloud-job"
  trainer:
    entrypoint: "python train.py"
    workspace: "/home/job-1/"
    min-instance: 3
    max-instance: 6
    resources:
      limits:
        alpha.kubernetes.io/nvidia-gpu: 1
        cpu: "800m"
        memory: "1Gi"
      requests:
        cpu: "500m"
        memory: "600Mi"
  pserver:
    min-instance: 3
    max-instance: 3
    resources:
      limits:
        cpu: "800m"
        memory: "1Gi"
      requests:
        cpu: "500m"
        memory: "600Mi"

Run kubectl create -f job.yaml is exactly equal to the current paddlectl submit -jobname xxx -gpu xxx ...

The only difference is that paddlectl client is able to upload and download training data files.

Yancey1989 commented 6 years ago

Cool! Maybe we can use kubectl instead of paddlectl? I have some ideas about this:

typhoonzero commented 6 years ago

Plus disadvantage: kubectl exposed too much details of kubernetes that users may never use.

Yancey1989 commented 6 years ago

An extra suggestion, shall we change the resource name from TrainingJob to Paddle? Maybe it makes more sense.

gongweibao commented 6 years ago

Plus disadvantage: If a YAML's format is not right, it's hard to find where it is, so it's not convenient for the user to use it.

typhoonzero commented 6 years ago

@Yancey1989 thought TrainingJob is more general, not only paddle training.

putcn commented 6 years ago

this is an interesting thinking. 👍 my 2 cents are: Can we make paddlectl kind of proxy to kubectl? so that we can do some filtering on the features we don't want to expose to end user before the parameters actually hit kubectl and still keep the same command pattern?

helinwang commented 6 years ago

Maybe our local command line can take the yaml as the input. So we don't have to map user's input to the ymal again.

I am more inclined not allowing our user to use kubectl, since what we want to support is just a subset of kubectl (e.g., do we want to allow the user create any Pod?), maybe we can use @putcn 's idea, "make paddlectl kind of proxy to kubectl, so that we can do some filtering"

typhoonzero commented 6 years ago

Support @putcn 's idea! Proxing and filter is simple enough and easy!

Yancey1989 commented 6 years ago

From @helinwang

do we want to allow the user create any Pod

I don't think so, it's not safely and out of our control.

From @putcn

make paddlectl kind of proxy to kubectl, so that we can do some filtering

It's a good idea! We can use cloud server as a proxy, paddlectl convert command-line parameters to YAML and cloud server submit the YAML to kubernetes.

Yancey1989 commented 6 years ago

Maybe I can develop this feature, how about push to the controller branch, so that we can publish a complete feature(auto-scaling) when we merge to the develop branch.

helinwang commented 6 years ago

@Yancey1989 Sure, that would be awesome!

pineking commented 6 years ago

That's a great idea, I have one more question, @Yancey1989 why we need cloud server to submit the YAML to kubernetes, could the paddlectl submit the YAML directly?

Yancey1989 commented 6 years ago

Hi @pineking , As the design #378 , PaddleCloud has its own account management , RBAC in kubernetes is too simple, so we can not submit the YAML directly, and I think this is the main reason.

pineking commented 6 years ago

@Yancey1989 , thanks, I will read the design.

helinwang commented 6 years ago

Today's discussion result:

  1. We still need server since it knows about cloud storage. Command line will be backward compatible (internally convert to yaml), support use submit yaml directly. Client will send yaml to server.

  2. Eventually controller will start / scale / kill training job (now controller is only scaling job).