dask / dask-kubernetes

Native Kubernetes integration for Dask
https://kubernetes.dask.org
BSD 3-Clause "New" or "Revised" License
312 stars 148 forks source link

Operator CRD spec too limited for production use #447

Closed samdyzon closed 2 years ago

samdyzon commented 2 years ago

This issue is a follow up to #446 - after finishing the work on this PR, I realised that my team also needs to inject a serviceAccountName property into the worker pods that get launched by the operator. So naturally, I started putting together a PR for this patch too - but it occurred to me that this is a problem that is just going to keep expanding as other developers need more k8s configurations injected into their pods. For example:

(* denotes requirements my team already requires)

It occurred to me that we would ultimately just be recreating the Pod core/v1 spec and embedding it inside the CRD files created, leading to significant maintenance requirements for keeping the CRD up-to-date with the latest API for Pods, not to mention possible bugs introduced through multiple versions of the Pod spec. Plus its not DRY, KISS or any of that stuff - personally, I'd hate having to maintain this.

Proposed Solution

Any solution to this problem should aim to eliminate the overhead and problems associated with re-writing k8s standard configurations (like Pods/Deployments) while enabling customisation of a CRD which works with the current operator structure (IE: I'm not proposing rewriting the operator). Since the operator does not currently have a build/release pipeline and CRDs aren't currently published in any standardised way, we could use a build process to convert OpenApi V3 references ($ref) into a "compiled" yaml template. This isn't a new idea, the developers at Agones [1] have some CRD definitions that are generated directly from their go source code.

An example with psuedocode:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: daskworkergroups.kubernetes.dask.org
spec:
  scope: Namespaced
  group: kubernetes.dask.org
  names:
    kind: DaskWorkerGroup
    ...
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            template:
              # TODO: Point this to something that can be resolved somehow
              $ref: 'io.k8s.api.core.v1.PodTemplateSpec'
            status:
              type: object
...

After running a build process over using something like k8s-crd-resolver you would get something like:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: daskworkergroups.kubernetes.dask.org
spec:
  scope: Namespaced
  group: kubernetes.dask.org
  names:
    kind: DaskWorkerGroup
    ...
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            template:
              description: PodTemplateSpec describes the data a pod should have when created from a template
              properties:
                metadata: ...
                spec:
                  description: PodSpec is a description of a pod.
                  ...
            status:
              type: object
...

Note that the output of the second yaml snippet is the real configuration - I put the actual output of k8s-crd-resolver in this gist: https://gist.github.com/samdyzon/2082109b588fef3031b250e3def01feb

Theoretically this methodology could also be used to build a pseudo-CRD for each Dask resource (scheduler and worker) individually, and the build process could create one or more CRD specifications for DaskCluster and DaskWorkerGroup without duplicating source code.

After installing the built CRD into a cluster, a developer could then deploy their resources with as much or as little configuration as needed, but with the same API as deploying a standard pod:

apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
  name: simple-cluster
  namespace: dask
spec:
  template:
    metadata: ...
    spec:
      serviceAccountName: dask-service-account
      containers:
      - name: worker
        image: dask/dask:latest
        args: ...
        ports:
          - ...
        env:
          - name: TEST_ENVIRONMENT
            value: hello-world

Finally, once the output CRD spec has everything it that is needed to spin up a pod, it is trivial to include the pod spec in build_worker_pod_spec and build_scheduler_service_spec:

def build_worker_pod_spec(name, namespace, image, n, template):
    # template refers to the CRD property we generated during build, see gist for more 
    worker_name = f"{name}-worker-{n}"
    return {
        "apiVersion": "v1",
        "kind": "Pod",
        # The example snippet also provided the metadata under template["metadata"]
        # but excluding it is just finding the schema that only has the "spec"...
        "metadata": {
            "name": worker_name,
            "labels": {
                "dask.org/cluster-name": scheduler_name,
                "dask.org/workergroup-name": name,
                "dask.org/component": "worker",
            },
        },
        "spec": template["spec"]
    }

Outcomes:

Dependencies

As I mentioned in my PR, my team and I are extremely keen to get this operator into production - our current methodology for managing our Dask workloads is a massive source of technical debt and reliability issues which I want to remove ASAP. With that in mind, I'd be willing to contribute to the development of this process and help design the full workflow. I'm not keen on starting this work without some feedback from the maintainers though, since I'd be working on this out-of-office hours.

What do you guys think? If there is interest, I can put together some more detailed design documentation.

References:

[1] - They use a build script: export-openapi.sh that produces a YAML file with the embedded spec inside it (and parses some comments to add doco etc) - Generated CRD from build tools

jacobtomlinson commented 2 years ago

This sounds great! We knew we would need to think about how to handle this next, and the proposal laid out here sounds like a great solution.

Having a pod spec embedded within the worker group spec makes total sense, we would probably also want this optionally for the scheduler too. I wans't aware of k8s-crd-resolver but that seems pretty neat. I worry a little about depending on a project that appears to have no user community (no issues, only 1 star, etc), but as you say it looks like it is maintained and released regularly.

To your comments on build pipelines we use GitHub Actions and have a release workflow that pushes Python packages and Docker images. As you say we are just leaving the CRDs in te repo and in the docs giving instructions to install straight from there. We could definitely do this better and I would be keen to hear if you have advice on how other projects handle this?

samdyzon commented 2 years ago

Thanks for your feedback, I'm glad to hear that my idea isn't too crazy! I'm keen to help move this along if you are interested, so I will start putting together a formal design for you guys to have a look at. In regards to your comments:

Having a pod spec embedded within the worker group spec makes total sense, we would probably also want this optionally for the scheduler too.

I was planning to suggest that the pod spec be required for both scheduler and worker resources. The CRD should be explicit in what it expects, so you always know what resources are being scheduled. My opinion is that if you want the pod spec to be "optional", it should be optional in the KubeCluster class, not in the CRD - this way you can implement the logic for a default scheduler spec in code that is decoupled from the operator.

I wans't aware of k8s-crd-resolver but that seems pretty neat. I worry a little about depending on a project that appears to have no user community (no issues, only 1 star, etc), but as you say it looks like it is maintained and released regularly.

Yeah I totally understand the hesitance here, but it is such a simple project that it would likely never get a user community, and would be easily forked if the owner were to go AWOL (its 130 LOC CLI tool).

To your comments on build pipelines we use GitHub Actions and have a release workflow that pushes Python packages and Docker images. As you say we are just leaving the CRDs in te repo and in the docs giving instructions to install straight from there. We could definitely do this better and I would be keen to hear if you have advice on how other projects handle this?

This is the interesting problem that I would need more input from you guys. My first instinct would be to build a helm chart for the CRD and operator. This provides a standard mechanism for releasing the built CRD template. I had a look through the dask/helm-chart project and see that you have an automated pipeline for updating the charts when new versions of Dask are released. Since there is already an established methodology, perhaps the same could be applied here - when a new version of the dask-kubernetes project is released, a pipeline in helm-charts is kicked off, which would:

Some drawbacks to this idea:

I don't have a good solution in mind for this, my default position is to find a standard methodology to use (hence Helm) but the limitations of that standard may be too much maintenance/hassle to be worth it. Not to mention that users may wish to update the CRD in the future, or build the CRD with specific Kubernetes API versions, which a helm chart wouldn't support anyway. What do you think?

jacobtomlinson commented 2 years ago

My opinion is that if you want the pod spec to be "optional", it should be optional in the KubeCluster class, not in the CRD - this way you can implement the logic for a default scheduler spec in code that is decoupled from the operator.

The goal here is for folks to be able to create clusters either via the k8s API directly or via the Python API. If the pod spec is only optional in the Python API isn't that going to add pain for the k8s users? I also think in the majority of cases the scheduler pod spec will match the worker spec so it would be nice to avoid duplication where we can.

Yeah I totally understand the hesitance here, but it is such a simple project that it would likely never get a user community, and would be easily forked if the owner were to go AWOL (its 130 LOC CLI tool).

Ok you've sold me on this :). Given it's a small and obscure dependency we will probably need to get them set up on conda-forge, but we can do that without their input if necessary. I already had to do this for kopf and pykube when we added the operator.

My first instinct would be to build a helm chart for the CRD and operator. This provides a standard mechanism for releasing the built CRD template.

We've definitely thought about packaging the operator as a helm chart. I think one of the reasons we haven't so far is that one of our primary targeted use cases is Kubeflow which does not use helm at all. They seem to favour CRDs in source on GitHub and kustomize for making user tweaks. I'm definitely not against a helm chart, but I worry about it being the only option, and maintaining two options is extra overhead.

I wonder if one solution would be to add pre-commit support to k8s-crd-resolver and store the built CRDs somewhere alongside the template ones. That way every time we change the CRD template and commit the hook would update the built CRDs (or pre-commit CI would do this on the PR if the user doesn't have the hooks installed properly). Then in the docs link to the built ones (and specific tags rather than main like we do currently). We already do this with tools like frigate in the Helm Chart repo for keeping autogenerated READMEs up to date.

If we do want to go down the Helm Chart road it's worth noting that the dask/helm-chart repo has a main branch with some chart source on and a gh-pages branch with the actual helm repo in. Other projects like dask-gateway build their charts in their own repo workflows and just push the artifacts to the gh-pages branch on the helm repo. That way all our charts can be found at https://helm.dask.org but it doesn't require us to keep all the chart source in the chart repo.

samdyzon commented 2 years ago

The goal here is for folks to be able to create clusters either via the k8s API directly or via the Python API. If the pod spec is only optional in the Python API isn't that going to add pain for the k8s users? I also think in the majority of cases the scheduler pod spec will match the worker spec so it would be nice to avoid duplication where we can.

I think my main goal in this case is to be more explicit about resources that are being deployed through k8s manifests. Kubernetes manifests are declarative and explicit, what you see is what you get, so requiring PodSpec configuration at this level is more idiomatic. I'd expect that If a developer is writing k8s yaml files, they are not going to be fazed by writing two pod specs. There is already enough difference between pod spec for scheduler and worker that the amount of repeated config is negligible.

For a python developer, or someone who is not interested in writing yaml manifests you can provide some sane defaults and inject them into the creation of the DaskCluster resource. So if a developer was to call the experimental KubeCluster but doesn't pass in any parameters for the scheduler pod, we can provide a default for them easy. This is perfectly reasonable from this perspective because its still idiomatic - you've written a custom API for deploying your custom resources so you control the logic for managing the sane defaults.

Ok you've sold me on this :). Given it's a small and obscure dependency we will probably need to get them set up on conda-forge, but we can do that without their input if necessary. I already had to do this for kopf and pykube when we added the operator.

Cool! Too easy, for now while I'm doing some local dev I'll add git+http://github.com/elemental-lf/k8s-crd-resolver to requirements.txt which works fine. I'm not familiar with conda/conda-forge so maybe I'll leave that up to you.

I wonder if one solution would be to add pre-commit support to k8s-crd-resolver and store the built CRDs somewhere alongside the template ones. That way every time we change the CRD template and commit the hook would update the built CRDs (or pre-commit CI would do this on the PR if the user doesn't have the hooks installed properly). Then in the docs link to the built ones (and specific tags rather than main like we do currently). We already do this with tools like frigate in the Helm Chart repo for keeping autogenerated READMEs up to date.

That's an interesting methodology and totally makes sense to me - kinda like a "dist" folder inside the repo with the built CRD templates committed. The rendered CRD files will be pretty large but if you don't mind having some dist files committed to the repo, I'm on board with this kind of technique.

If we do want to go down the Helm Chart road it's worth noting that the dask/helm-chart repo has a main branch with some chart source on and a gh-pages branch with the actual helm repo in. Other projects like dask-gateway build their charts in their own repo workflows and just push the artifacts to the gh-pages branch on the helm repo. That way all our charts can be found at https://helm.dask.org/ but it doesn't require us to keep all the chart source in the chart repo.

I'm intrigued by this idea too - at a personal level, I anticipate deploying this service to our cluster using a Helm chart, even if that means maintaining my own inside our project. I'm sure there are others that would be interested in a helm deployment too. I totally understand your desire to not maintain multiple methods of deployment, so I'll leave that decision up to you. I'm looking through the dask-gateway repo to see how they do it, and looking at the doco for chartpress for more details too.

jacobtomlinson commented 2 years ago

I think my main goal in this case is to be more explicit about resources that are being deployed through k8s manifests.

Your reasoning makes total sense here. I'm happy with that.

I'm not familiar with conda/conda-forge so maybe I'll leave that up to you.

No problem.

That's an interesting methodology and totally makes sense to me - kinda like a "dist" folder inside the repo with the built CRD templates committed. The rendered CRD files will be pretty large but if you don't mind having some dist files committed to the repo, I'm on board with this kind of technique.

Yeah that's pretty much what I was thinking. That is fine with me.

... I anticipate deploying this service to our cluster using a Helm chart, even if that means maintaining my own inside our project.

If you are happy to contribute a chart then I would be happy to merge it. If keeping it in sync with the dist CRDs proves to be challenging we can rethink it later and always drop it if it becomes a burden.

jacobtomlinson commented 2 years ago

I think everything here was covered in #452. I also updated the deploy workflow in #478 to release the new helm chart.

I've tagged 2022.5.0 so this should be good to close. If there is anything outstanding @samdyzon would you mind raising separate issues to track?