flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.73k stars 648 forks source link

[BUG] Tasks get stuck in Queued status #5927

Open sorushsaghari opened 2 days ago

sorushsaghari commented 2 days ago

Describe the bug

Tasks in the Flyte deployment are not executing and remain in either an unknown or queued state indefinitely. No task progresses to the running or completed state, effectively halting workflow execution.

Expected behavior

Tasks should transition from the queued state to running, followed by completion, provided that no errors or resource constraints are encountered.

Additional context to reproduce

1- Set up Flyte using the provided Helm configuration. 2- Trigger a workflow that contains at least one task. 3- Observe that the tasks remain in queued status without progressing.

Helm configuration:

flyte-core-components:
  admin:
    disabled: false
    disableScheduler: false
    disableClusterResourceManager: false
    seedProjects:
      - <project-name>

  propeller:
    disabled: false
    disableWebhook: false
  dataCatalog:
    disabled: false

deployment:
  image:
    repository: <docker-registry-url>/flyte-binary-release
    tag: v1.13.3
  resources:
    limits:
      memory: 4Gi
      cpu: 3
    requests:
      memory: 4Gi
      cpu: 2
  waitForDB:
     image:
      repository: <docker-registry-url>/postgres

configuration:
  database:
    username: <db-username>
    password: <db-password>
    host: <db-host>
    port: 5432
    dbname: <db-name>
    options: sslmode=disable

  storage:
    metadataContainer: <meta-container>
    userDataContainer: <user-container>
    provider: s3
    providerConfig:
      s3:
        disableSSL: true
        v2Signing: true
        authType: accesskey
        accessKey: <s3-access-key>
        secretKey: <s3-secret-key>
        endpoint: "<s3-endpoint>"

  logging:
    show-source: true
    level: 15

  auth:
    enabled: false
  co-pilot:
    image:
      repository: <docker-registry-url>/flytecopilot
      tag: v1.13.3

service:
  type: ClusterIP

ingress:
  create: true
  host: <flyte-host-url>
  separateGrpcIngress: true

rbac:
  create: true
  extraRules:
    - apiGroups:
        - "ray.io"
      resources:
        - rayclusters
        - rayjobs
        - rayservices
      verbs:
        - "*"

serviceAccount:
  create: true

image

Screenshots

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

davidmirror-ops commented 1 day ago

@sorushsaghari there must be a pod on the corresponding namespace (maybe flytesnacks-development) could you share the output of kubectl describe on that Pod?

sorushsaghari commented 1 day ago

@davidmirror-ops there is no pod in development, -staging, -production namespaces of the project

davidmirror-ops commented 1 day ago

Got it, what about logs from the flyte-binary Pod?

sorushsaghari commented 1 day ago

Got it, what about logs from the flyte-binary Pod?

heres the log file binary.log

davidmirror-ops commented 10 hours ago

ok looks like you're using namespaces other than the default (totally fine). Could you find a pod in the corresponding namespace? (maybe run kubectl get pods -A to start with)

sorushsaghari commented 10 hours ago

@davidmirror-ops i have aleready done this. and check any possible namepsaces. but i dont find any pod there . my main problem is the logs. they are not descriptive and i cannot find the problem