flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.49k stars 590 forks source link

[FlyteCTL Feature] Have flytekit return status of a backfill #4621

Open guy4261 opened 9 months ago

guy4261 commented 9 months ago

Describe the feature/command for FlyteCTL

When I launch a backfill from flytekit, the web UI lets me see its state - which of the nodes (n0 , n1 , ...) are running, which succeeded, which failed).

Provide a possible output or UX example

import flytekit
import flytekit.remote

config = flytekit.configuration.Config.auto("config.yml")
r = flytekit.remote.FlyteRemote(config)

wf: FlyteWorkflowExecution = r.launch_backfill(
    project=project,
    domain=domain,
    from_date=part.from_date,
    to_date=part.to_date,
    launchplan=launchplan,
    launchplan_version=launchplan_version,
    execution_name=execution_name,
)

"""
# Internal structure of the `wf` object:

{
  "id": {
    "project": "marketplace--pa--dags--bpo-ltv-model",
    "domain": "dprod",
    "name": "f34c31c37272c492f928"  # execution id
  },
  "spec": {
    "launchPlan": {
      "resourceType": "LAUNCH_PLAN",
      "project": "marketplace--pa--dags--bpo-ltv-model",
      "domain": "dprod",
      "name": "backfill-bpo_ltv_90d_horizon_1d_observation",
      "version": "bbGJcfQA6wmPZNIqAmHHSg=="  # of the launchplan - these dates
    },
    "metadata": {
      "principal": "ChUxMDgwMzg0MjQ4NDE3Nzc0Mzg5MzgSBmdvb2dsZQ",
      "systemMetadata": {}
    },
    "notifications": {},
    "labels": {},
    "annotations": {},
    "securityContext": {
      "runAs": {}
    },
    "authRole": {}
  },
  "closure": {
    "startedAt": "1970-01-01T00:00:00Z",
    "duration": "0s",
    "createdAt": "2023-12-19T09:26:27.169745Z",
    "updatedAt": "2023-12-19T09:26:27.169745Z"
  }
}
"""

r.fetch_workflow_execeution_status(project, domain, name)
# should provide the node-level information - something that
# `r.fetch_execution` is missing.

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

dosubot[bot] commented 9 months ago

šŸ¤–

It's great to see your interest in improving the flytekit! Adding the functionality to return the status of a backfill from flytekit would be a valuable addition. Your proposed solution and code example provide a clear understanding of the feature you are looking to add. We encourage you to go ahead and create a pull request with your changes. Thank you for your contribution to flyte!


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a šŸ‘ if this is helpful and šŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

guy4261 commented 9 months ago

My bad!!!

r.sync_execution(wf, sync_nodes=True)  # return value is `wf`, updated
wf.node_executions  # will have all the data

Thanks!

guy4261 commented 9 months ago

Alright - this works if none of the executions failed; however if one did, then:

>>> r.sync_execution(wf, sync_nodes=True)
FlyteAssertion: Outputs could not be found because the execution ended in failure.

Why is that? I want to know which node failed - how do I get this information?

Note that in flytectl:

> flytectl \
  --config path/to/config \
  get execution \
  --project my_project \
  --domain my_domain \
  fc877cde175bf4c1787e
  --details
  -o yaml

I would simply get the info for all the nodes that executed - all the successful ones + the last, failed one:

- node_exec:
    closure:
      createdAt: "2023-12-19T13:52:06.661984975Z"
      outputUri: s3://my-s3-bucket/metadata/propeller/my-awesome-project-fc877cde175bf4c1787e/start-node/data/0/outputs.pb
      phase: SUCCEEDED
      updatedAt: "2023-12-19T13:52:06.662023413Z"
    id:
      executionId:
        domain: my-domain
        name: fc877cde175bf4c1787e
        project: my-awesome-project
      nodeId: start-node
    inputUri: s3://my-s3-bucket/metadata/propeller/my-awesome-project-fc877cde175bf4c1787e/start-node/data/inputs.pb
    metadata:
      specNodeId: start-node
- inputs:
    inference_date: "2023-01-09T06:15:00Z"
  node_exec:
    closure:
      createdAt: "2023-12-19T13:52:06.757275080Z"
      duration: 909.431643650s
      outputUri: s3://my-s3-bucket/metadata/propeller/my-awesome-project-fc877cde175bf4c1787e/n0/data/0/outputs.pb
      phase: SUCCEEDED
      startedAt: "2023-12-19T13:52:06.986261529Z"
      updatedAt: "2023-12-19T14:07:16.417949966Z"
      workflowNodeMetadata:
        executionId:
          domain: my-domain
          name: fayq1lhuqnm3ka
          project: my-awesome-project
    id:
      executionId:
        domain: my-domain
        name: fc877cde175bf4c1787e
        project: my-awesome-project
      nodeId: n0
    inputUri: s3://my-s3-bucket/metadata/propeller/my-awesome-project-fc877cde175bf4c1787e/n0/data/inputs.pb
    metadata:
      specNodeId: n0
  outputs:
    o0: true
......all the way to the last node, which failed...
- inputs:
    inference_date: "2023-02-07T06:15:00Z"
  node_exec:
    closure:
      createdAt: "2023-12-19T20:33:08.100551184Z"
      duration: 488.966212162s
      error:
        code: USER:Unknown
        kind: USER
        message: |-
          Traceback (most recent call last):

                File "/home/build/app/venvs/flyte-app-ZGNJbPM0-py3.10/li
      phase: FAILED
      startedAt: "2023-12-19T20:33:08.402672298Z"
      updatedAt: "2023-12-19T20:41:17.368987691Z"
      workflowNodeMetadata:
        executionId:
          domain: my-domain
          name: fr2rt3xmtrpuu5
          project: my-awesome-project
    id:
      executionId:
        domain: my-domain
        name: fc877cde175bf4c1787e
        project: my-awesome-project
      nodeId: n29
    inputUri: s3://my-s3-bucket/metadata/propeller/my-awesome-project-fc877cde175bf4c1787e/n29/data/inputs.pb
    metadata:
      specNodeId: n29
guy4261 commented 8 months ago

I ended up implementing this myself šŸ¤· Still, it could've been nice to have been part of the framework. It wasn't as easy as I'd expect :((