flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.22k stars 551 forks source link

[UI Feature] Propagate dynamic workflow error messages from flytepropeller to UI #4466

Open fg91 opened 7 months ago

fg91 commented 7 months ago

Motivation: Why do you think this is important?

Our platform users regularly ping us with executions where a dynamic workflow failed to correctly register.

Flyte's UI doesn't provide helpful information to understand that the registration of the dynamic sub workflow has failed and how to fix it.

In this example, the dynamic task pod itself succeeded but the node failed:

image

We in the MLOps team then typically look in the flytepropeller logs to try to understand how the dynamic subworkflow needs to be fixed.

In this case, the propeller logs showed:

issue

Goal: What should the final outcome look like, ideally?

It would be very helpful if FlyteConsole would show the dynamic workflow related errors occurring in flytepropeller so that users can fix them themselves.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

hamersaw commented 7 months ago

I suspect this is related to https://github.com/flyteorg/flyte/issues/4483. These use a flyteidl error that is wrapped to support the DeepCopy function (here and here), My concern is that the latest update to k8s dependencies changed how the k8s client unmarshals CRDs into json structs.

hamersaw commented 7 months ago

@fg91 do you have a minimal repro? Think I found the issues and fixed similarly for branch node error reporting, but want to double check here.

fg91 commented 7 months ago

@fg91 do you have a minimal repro?

Sorry for the late reply @hamersaw . This is a minimal repro:

from flytekit import task, dynamic, workflow

@task
def foo(a = 1):  # Notice the missing type hint
    print(a)

@dynamic
def sub_wf():
    foo(a=1)

@workflow
def wf():
    sub_wf()

The pod succeeds but the dynamic task fails without any error shown in the UI:

Screenshot 2023-12-19 at 16 54 16

Propeller logs show:

E1219 15:53:40.603043       1 workers.go:103] error syncing '.../f96f28b7a6aa64183869': 0: [User] malformed dynamic workflow, caused by: Collected Errors: 1
    Error 0: Code: MismatchingTypes, Node Id: dn0, Description: Variable [a] (type [simple:INTEGER ]) doesn't match expected type [simple:NONE ].

1: 0: [User] malformed dynamic workflow, caused by: Collected Errors: 1
    Error 0: Code: MismatchingTypes, Node Id: dn0, Description: Variable [a] (type [simple:INTEGER ]) doesn't match expected type [simple:NONE ].
hamersaw commented 7 months ago

@fg91 thanks, this helps a lot. So I'm seeing this is a UI issue. Everything is stored correctly in FlyteAdmin (error included) but just not displayed in the UI. Running your example here, and then querying admin directly with grpcurl shows:

hamersaw@ragnarok:~/Downloads$ ./grpcurl -plaintext -d '{"id": {"execution_id":{"project":"flytesnacks","domain":"development","name":"f4e6b756dcb254e41851"}, "node_id":"n0"}}'  127.0.0.1:30081 flyteidl.service.AdminService.GetNodeExecution
{
  "id": {
    "nodeId": "n0",
    "executionId": {
      "project": "flytesnacks",
      "domain": "development",
      "name": "f4e6b756dcb254e41851"
    }
  },
  "inputUri": "s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f4e6b756dcb254e41851/n0/data/inputs.pb",
  "closure": {
    "error": {
      "code": "DynamicWorkflowBuildFailed",
      "message": "[User] malformed dynamic workflow, caused by: Collected Errors: 1\n\tError 0: Code: MismatchingTypes, Node Id: dn0, Description: Variable [a] (type [simple:INTEGER ]) doesn't match expected type [simple:NONE ].\n",
      "kind": "USER"
    },
    "phase": "FAILED",
    "startedAt": "2023-12-19T20:48:19.671343654Z",
    "duration": "4.689258075s",
    "createdAt": "2023-12-19T20:48:19.643655673Z",
    "updatedAt": "2023-12-19T20:48:24.360736789Z"
  },
  "metadata": {
    "specNodeId": "n0",
    "isDynamic": true
  }
}

So propeller is correctly reporting, just should be displayed. Will re-direct accordingly.

fg91 commented 7 months ago

Thanks for testing @hamersaw 🙏

I understand that frontend is probably pretty busy already but it would help the UX of dynamic workflows a lot if they could implement this. (Unfortunately I have no clue about frontend things otherwise I'd be happy to tackle it myself 🙈)

Our ML engineers make heavy use of dynamic workflows and we in the ops team get a question like "Why is my Flyte workflow in a weird succeeded yet failed state again?" every few days.

davidmirror-ops commented 6 months ago

01/18/2024 Contributor's meetup notes: Members agreed the UI should surface detailed error messages from propeller to aid users. John will investigate propagating errors appropriately in the UI.

dosubot[bot] commented 6 months ago

🤖

Hey @fg91, great to see you again! Trust you've been coding your way to glory since our last interaction. Let's dive into this new challenge you've got for us.

I understand the issue you're facing and I agree that improving the FlyteConsole to display dynamic workflow related errors would greatly enhance the user experience. Here's a high-level approach to achieve this:

  1. Expose Compile Errors through an API Endpoint: Create an API endpoint in the flytepropeller service that returns the list of compile errors. This endpoint should call the Errors() method of the CompileErrors interface to get the list of errors and return it in the response. You can refer to the flytepropeller/pkg/compiler/errors/error.go for more details on how errors are handled in flytepropeller.

  2. Fetch Compile Errors in FlyteConsole: In the FlyteConsole, make a request to the new API endpoint to fetch the list of compile errors when a dynamic workflow fails to register.

  3. Display Compile Errors in FlyteConsole: Display the error details in the FlyteConsole. The error details can be obtained by calling the Error() method of the CompileError and CompileErrorWithCause types.

Please note that this is a high-level approach and the actual implementation may vary based on the existing architecture and design of the flytepropeller and FlyteConsole services.

Regarding the error handling mechanism in the flytepropeller service, it primarily uses the errors package to create, manage, and check for compile errors. You can refer to the flytepropeller/pkg/compiler/workflow_compiler_test.go for more details on how errors are handled, especially for dynamic workflows.

I hope this helps! If you have any further questions or need more clarification, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

fgerzer commented 5 months ago

It would've been very helpful for me too to have this exposed; I just spent an hour trying to understand how my dynamic both failed and succeeded at the same time.

fg91 commented 1 month ago

01/18/2024 Contributor's meetup notes: Members agreed the UI should surface detailed error messages from propeller to aid users. John will investigate propagating errors appropriately in the UI.

@davidmirror-ops please excuse bringing this up again but is there any chance the frontend team could tackle this in one of the upcomming sprints? These types of errors in dynamic workflows are almost impossible to debug for users without the help of a platform engineer because you need to know that you have to search for the execution id in the propeller logs and our users struggle with this again and again 🙈

jdanbrown commented 3 weeks ago

+1 I'm staring at a similarly baffling error msg, and I'm so far trying and failing to guess my way through my team's code to see which of our many @dynamic's (and @task's?) might have a signature that looks like this. I haven't found it yet, so I'm currently pretty confused and trying to google instead, which is how I ended up here.

0: [User] malformed dynamic workflow, caused by: Collected Errors: 5
    Error 0: Code: MismatchingTypes, Node Id: dn10, Description: Variable [train_outputs] (type [simple:STRUCT]) doesn't match expected type [simple:FLOAT].
    Error 1: Code: MismatchingTypes, Node Id: dn7, Description: Variable [train_outputs] (type [simple:STRUCT]) doesn't match expected type [simple:STRING].
    Error 2: Code: ParameterNotBound, Node Id: dn10, Description: Parameter not bound [model_loss].
    Error 3: Code: ParameterNotBound, Node Id: dn10, Description: Parameter not bound [train_time_sec].
    Error 4: Code: ParameterNotBound, Node Id: dn7, Description: Parameter not bound [model_path].