flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.74k stars 651 forks source link

[BUG] Namespace creation fails with default pod template #5376

Closed nuhaj closed 3 months ago

nuhaj commented 5 months ago

Describe the bug

when configuring a basic default pod template the name spaces are not created. Projects are created without error but there are no traces of the namespace for the project. When deploying a workflow the propeller error is failed to create workflow in propeller namespaces "test-development" not found…

Removing the default pod, the namespaces are created

Expected behavior

I expect the namespaces to be created or an error when creation failed. I expect the default pod template to be used as either primary, default containers or ignored.

Additional context to reproduce

 clusterResourceTemplates = {
      inline = {
        "test_template.yaml" = file("${path}/test_template.yaml")
      }
    }
....
        plugins = {
          k8s = {
            default-pod-template-name = "test_template.yaml" # also tried with "test_template"
          }
        }

Trimming down the default template to the most basic skeleton for testing. test_template.yaml

apiVersion: v1
kind: PodTemplate
metadata:
  name: test_template
  namespace: flyte # default for all namespaces
template:
  metadata:
    labels:
      foo: from-pod-template
  spec:
    containers:
      - name: noob # should be ignore, also tried with default and primary
        image: docker.io/rwgrim/docker-noop

Screenshots

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome[bot] commented 5 months ago

Thank you for opening your first issue here! 🛠

runllm[bot] commented 5 months ago

The issue you're experiencing seems related to the configuration and namespace handling of PodTemplates in Flyte. According to this discussion, PodTemplates need to be present in the namespace where the execution is triggered. Additionally, another discussion mentions that PodTemplates are namespace-specific and should be configured accordingly. Lastly, this issue highlights that the default-pod-template-name should be correctly set in the FlytePropeller configuration.

Would you like more details on how to configure the namespaces and PodTemplates correctly?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

davidmirror-ops commented 5 months ago

@nuhaj what version of Flyte are you using? @hamersaw is this the PR that freed the watch from having to have the default Pod template set? https://github.com/flyteorg/flytepropeller/pull/614

nuhaj commented 5 months ago

flyte-binary-release:v1.12.0

hamersaw commented 5 months ago

@nuhaj what version of Flyte are you using? @hamersaw is this the PR that freed the watch from having to have the default Pod template set? https://github.com/flyteorg/flytepropeller/pull/614

Yes, this looks like it.

nuhaj commented 4 months ago

Any updates here? In a similar issue we can not use clusterResourceTemplate to define namespace for flyte-binary

davidmirror-ops commented 4 months ago

@nuhaj I'll work on reproducing this behavior, but I'm curious, what's the content of your test_template.yaml file?

nuhaj commented 4 months ago

I posted a trimmed version above that I tested just to see if the error would resolve, I expect it to ignore the container since name: noob and its expecting name to be default or primary.

Below is the default pod for all namespaces that I initially planned to set. The intent was a default pod with persistent volume claim. The definition below works when defined in the task decorator as a compile-time pod template but not as a default pod

apiVersion: v1
kind: PodTemplate
metadata:
  name: flyte-workflow-base
  namespace: flyte
template:
  metadata:
    name: flyte-workflow-base
  spec:
    initContainers:
      - name: init
        image: alpine
        volumeMounts:
        - name: shared-data
          mountPath: /data
    containers:
      - name: primary       
        volumeMounts:
          - name: shared-data
            mountPath: /data
    volumes:  
      - name: shared-data
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 5Gi
davidmirror-ops commented 4 months ago

Thanks for sharing. My confusion comes from what I see Flyte expects in a template for the ClusterResources controller (see example). The PodTemplate is a different resource and I don't see how any PodTemplate definition would be able to create the project-domain namespaces. In that sense, what the clusterResourceTemplates section should contain is a namespace spec, even the PodTemplate can be part of that section I guess, but the sole PodTemplate won't create the namespaces, unless I'm missing something. Does that make sense?

nuhaj commented 4 months ago

Yes they are separate issues, I was attempting to create a default podtemplate with pvc and separately also override the default namespace of {{project}}-{{domain}}. For this open report we can focus on the default podtemplate resource.

davidmirror-ops commented 4 months ago

@nuhaj regarding the PodTemplate behavior I just had success using the flyte-binary v1.12.0 with the following config:

In the Helm values:

inline:
    plugins:
      k8s:
        inject-finalizer: true
        default-pod-template-name: "flyte-workflow-base"

The PodTemplate:

apiVersion: v1
kind: PodTemplate
metadata:
  name: flyte-workflow-base
  namespace: flyte
template:
  metadata:
    name: flyte-workflow-base
  spec:
    initContainers:
      - name: init
        image: alpine
        volumeMounts:
        - name: shared-data
          mountPath: /data
    containers:
      - name: default 
        image: rwgrim/docker-noop     
        volumeMounts:
          - name: shared-data
            mountPath: /data
        terminationMessagePath: "/dev/foo"
    hostNetwork: false
    volumes:  
      - name: shared-data
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 5Gi

And running a simple workflow:

import typing
from flytekit import task, workflow
@task
def say_hello(name: str) -> str:
    return f"hello {name}!"
@task
def greeting_length(greeting: str) -> int:

@workflow
def wf(name: str = "union") -> typing.Tuple[str, int]:
    greeting = say_hello(name=name)
    greeting_len = greeting_length(greeting=greeting)
    return greeting, greeting_len

if __name__ == "__main__":
    print(f"Running wf() { wf(name='passengers') }")

I get the following Pod spec:

Resulting Pod spec ``` k describe po fa0428345b5ae4f778cd-n0-0 -n flytesnacks-development Name: fa0428345b5ae4f778cd-n0-0 Namespace: flytesnacks-development Priority: 0 Service Account: default Node: flytebinary/192.168.67.2 Start Time: Tue, 18 Jun 2024 15:39:53 -0500 Labels: domain=development execution-id=fa0428345b5ae4f778cd interruptible=false node-id=n0 project=flytesnacks shard-key=2 task-name=hello-with-podtemplate-say-hello workflow-name=hello-with-podtemplate-wf Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: false primary_container_name: fa0428345b5ae4f778cd-n0-0 Status: Running IP: 10.244.0.35 IPs: IP: 10.244.0.35 Controlled By: flyteworkflow/fa0428345b5ae4f778cd Init Containers: init: Container ID: docker://e0cd9f92a5a8ed64e7d8c7eb7af600ffae930eb6901a146a7df076c5058b5e5b Image: alpine Image ID: docker-pullable://alpine@sha256:77726ef6b57ddf65bb551896826ec38bc3e53f75cdde31354fbffb4f25238ebd Port: Host Port: State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 18 Jun 2024 15:39:55 -0500 Finished: Tue, 18 Jun 2024 15:39:55 -0500 Ready: True Restart Count: 0 Environment: Mounts: /data from shared-data (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-s6sdb (ro) Containers: fa0428345b5ae4f778cd-n0-0: Container ID: docker://2e76cca15e500fdb6d29ad53d5bb11a768ffde81ca4c5a0b533da55104112e15 Image: cr.flyte.org/flyteorg/flytekit:py3.11-1.11.0 Image ID: docker-pullable://cr.flyte.org/flyteorg/flytekit@sha256:426e7ba39b07f7b9bbc8df5b3166db1e5ac24a1502251820be2b19f8d92b105c Port: Host Port: Args: pyflyte-fast-execute --additional-distribution s3://flyte/flytesnacks/development/C3UVRAYHHEEJLHP57SHO6HCKRQ======/script_mode.tar.gz --dest-dir . -- pyflyte-execute --inputs s3://flyte/metadata/propeller/flytesnacks-development-fa0428345b5ae4f778cd/n0/data/inputs.pb --output-prefix s3://flyte/metadata/propeller/flytesnacks-development-fa0428345b5ae4f778cd/n0/data/0 --raw-output-data-prefix s3://flyte/data/k4/fa0428345b5ae4f778cd-n0-0 --checkpoint-path s3://flyte/data/k4/fa0428345b5ae4f778cd-n0-0/_flytecheckpoints --prev-checkpoint "" --resolver flytekit.core.python_auto_container.default_task_resolver -- task-module hello-with-podtemplate task-name say_hello State: Running Started: Tue, 18 Jun 2024 15:39:57 -0500 Ready: True Restart Count: 0 Limits: cpu: 100m memory: 500Mi Requests: cpu: 100m memory: 500Mi Environment: FLYTE_INTERNAL_EXECUTION_WORKFLOW: flytesnacks:development:hello-with-podtemplate.wf FLYTE_INTERNAL_EXECUTION_ID: fa0428345b5ae4f778cd FLYTE_INTERNAL_EXECUTION_PROJECT: flytesnacks FLYTE_INTERNAL_EXECUTION_DOMAIN: development FLYTE_ATTEMPT_NUMBER: 0 FLYTE_INTERNAL_TASK_PROJECT: flytesnacks FLYTE_INTERNAL_TASK_DOMAIN: development FLYTE_INTERNAL_TASK_NAME: hello-with-podtemplate.say_hello FLYTE_INTERNAL_TASK_VERSION: 7g1m6UNX8h7taWI8mE39hg FLYTE_INTERNAL_PROJECT: flytesnacks FLYTE_INTERNAL_DOMAIN: development FLYTE_INTERNAL_NAME: hello-with-podtemplate.say_hello FLYTE_INTERNAL_VERSION: 7g1m6UNX8h7taWI8mE39hg FLYTE_AWS_ENDPOINT: http://minio.flyte.svc.cluster.local:9000 FLYTE_AWS_ACCESS_KEY_ID: minio FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage Mounts: /data from shared-data (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-s6sdb (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: shared-data: Type: EphemeralVolume (an inline specification for a volume that gets created and deleted with the pod) StorageClass: Volume: Labels: Annotations: Capacity: Access Modes: VolumeMode: Filesystem kube-api-access-s6sdb: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 21s default-scheduler 0/1 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "fa0428345b5ae4f778cd-n0-0-shared-data". preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.. Normal Scheduled 20s default-scheduler Successfully assigned flytesnacks-development/fa0428345b5ae4f778cd-n0-0 to flytebinary Normal Pulling 19s kubelet Pulling image "alpine" Normal Pulled 18s kubelet Successfully pulled image "alpine" in 946.808333ms (946.812167ms including waiting) Normal Created 18s kubelet Created container init Normal Started 18s kubelet Started container init Normal Pulling 18s kubelet Pulling image "cr.flyte.org/flyteorg/flytekit:py3.11-1.11.0" Normal Pulled 16s kubelet Successfully pulled image "cr.flyte.org/flyteorg/flytekit:py3.11-1.11.0" in 2.262136209s (2.262146584s including waiting) Normal Created 16s kubelet Created container fa0428345b5ae4f778cd-n0-0 Normal Started 16s kubelet Started container fa0428345b5ae4f778cd-n0-0 ```

Notice here that I used default as the container name, which should instruct flytepropeller to use this spec as a base for all the containers in the Pod, not only the primary (in which case, you'd use primary as the container name)


In regards to the namespace issue which I wasn't able to reproduce: without even using anything in the clusterResourceTemplates section and creating a new project:

flytectl create project --name projectx --id projectx

The cluster resources controller creates the namespaces:

k get ns
NAME                      STATUS   AGE
default                   Active   63d
flyte                     Active   62d
flytesnacks-development   Active   61d
flytesnacks-production    Active   61d
flytesnacks-staging       Active   61d
kube-node-lease           Active   63d
kube-public               Active   63d
kube-system               Active   63d
projectx-development      Active   6h47m
projectx-production       Active   6h47m
projectx-staging          Active   6h47m

Let me know if you have additional questions.

nuhaj commented 4 months ago

@davidmirror-ops for the namespace we are over riding the default namespace {{ project }}-{{ domain }} in cluster resource template. Instead of projectx-development we would get say flyte-projectx-development pyflyte run still tries to register the workflow to projectx-development. Would a default pod template be used here to over ride the namespace again?

davidmirror-ops commented 4 months ago

Would a default pod template be used here to over ride the namespace again?

Maybe that would work but I don't think it's a very maintainable workaround.

So, is flyte-projectx a project in your environment? Otherwise flyte will still run your workflows on the project-domain namespace.

If this is an existing namespace, you can instruct Flyte to run your executions on a particular namespace:

configuration:
  inline:
    namespace_mapping:
      template: "my_namespace"

You can create new projects using flytectl create project --name <PROJECT_NAME> --id <PROJECT_NAME>

davidmirror-ops commented 3 months ago

@nuhaj is this still an issue in your environment?

nuhaj commented 3 months ago

@davidmirror-ops I was away. Trying this now. Did you also have a section inside clusterResourceTemplates.inline?

clusterResourceTemplates = {

  "flyte-workflow-base.yaml" = ...
davidmirror-ops commented 3 months ago

@nuhaj No, I wasn't setting anything under that section

nuhaj commented 3 months ago

@davidmirror-ops The init container section of the pod describe output does not appear for me even with a new project and workflow deployment . I do see the pod-template in the config but I don't see the yaml contents of pod-template extrapolated anywhere

100-inline-config.yaml: |
    k8s:
      default-pod-template-name: pod-template
      inject-finalizer: true

How does the helm chart know where "flyte-workflow-base" is ? defined the way you have it there is not context on path

inline:
    plugins:
      k8s:
        inject-finalizer: true
        default-pod-template-name: "flyte-workflow-base"
davidmirror-ops commented 3 months ago

The init container section of the pod describe output does not appear for me even with a new project and workflow deployment

You mean an init container as part of the flyte-binary pod or the execution one?

How does the helm chart know where "flyte-workflow-base" is ?

The logic is not applied by the Helm chart. This field ends up on a configmap that then propeller picks up.

When you define this global Pod template as part of the K8s plugin config, propeller starts a watch looking first for that template in the namespace where the task is being executed, otherwise, it looks up on the flyte namespace (see docs)

nuhaj commented 3 months ago

@davidmirror-ops we managed to get the default pod template working by

  1. Adding the pod template to the clusterResourceTemplate inline definition
  2. Removing namespace from metadata (for default template for all pods)
    metadata:
    name: flyte-workflow-base
    namespace: flyte

Thank you for your help, the pod description and template set us in the right direction to debug

davidmirror-ops commented 3 months ago

@nuhaj great, hopefully, we'll improve the PodTemplates docs soon to cover some of the gaps. Any other questions please let us know. Thanks!