flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.81k stars 660 forks source link

[Core feature] Enhanced promise attribute support for Dataclasses in Python Flytekit #5667

Open JackUrb opened 3 months ago

JackUrb commented 3 months ago

Motivation: Why do you think this is important?

We pass around most of our configuration in flyte via python @dataclass as this makes it very easy for us to manage configuration and such. Unfortunately, this also means that we have to write a ton of wrappers for the inputs + outputs of our flyte @task and @workflow's, as we are often dereferencing attributes to pass along to other inputs, but need to wrap them (in lists and other dataclasses). Something like:

@dataclass
class FunctionAInput:
    x: int
    y: int

@dataclass
class FunctionAOutput:
    a: int
    b: int

@workflow
def my_sub_wf(my_in: FunctionAInput) -> FunctionAOutput:
    return FunctionAOutput(a=my_in.x, b=my_in.y)

@dataclass
class WorkflowInput:
    in_1: int
    in_2: int

@workflow
def run_my_fn(workflow_in: WorkflowInput) -> FunctionAOutput:
    return my_sub_wf(FunctionAInput(x=workflow_in.in_1, y=workflow_in.in_2))

doesn't end up working, as we can't use the attributes of a promise dataclass nested in containers as inputs.

Instead, we're forced to do things like:

@dataclass
class FunctionAInput:
    x: int
    y: int

@task 
def wrap_fn_a_inputs(x: int, y:int) -> FunctionAInput:
    return FunctionAInput(x=x, y=y)

@dataclass
class FunctionAOutput:
    a: int
    b: int

@workflow 
def wrap_fn_a_outputs(a: int, b: int) -> FunctionAOutput:
    return FunctionAOutput(a=a, b=b)

@task
def my_sub_wf(my_in: FunctionAInput) -> FunctionAOutput:
    return wrap_fn_a_outputs(a=my_in.x, b=my_in.y)

@dataclass
class WorkflowInput:
    in_1: int
    in_2: int

@workflow
def run_my_fn(workflow_in: WorkflowInput) -> FunctionAOutput:
    return my_sub_wf(wrap_fn_a_inputs(x=workflow_in.in_1, y=workflow_in.in_2))

While this doesn't add much to the above toy example, when the dataclasses have many more fields and depth, this gets messy fast. Making changes to a dataclass requires us to update all wrappers for it, the actual visualized graph becomes bloated with wrappers, and ease-of-use goes down.

It's also an issue when trying to wrap a dataclass promise's attribute in a container, such as:

@task
def get_int() -> int:
    return 3

@dataclass
class IntWrapper(DataClassJSONMixin):
    x: int

@task
def get_wrapped_int() -> IntWrapper:
    return IntWrapper(x=3)

@task 
def sum_list(input_list: list[int]) -> int:
    return sum(input_list)

@workflow
def convert_list_workflow1() -> int:
    # This workflow is fine
    promised_int = get_int()
    joined_list = [4, promised_int]
    return sum_list(input_list=joined_list)

@workflow
def convert_list_workflow2() -> int:
    # But this one is not
    wrapped_int = get_wrapped_int()
    joined_list = [4, wrapped_int.x]
    return sum_list(input_list=joined_list)

Goal: What should the final outcome look like, ideally?

Overall I imagine the approach is to increase the level of support for Dataclass attributes across flytekit, doing more to:

  1. Allow wrapping a dataclass promise attribute in a standard collection, such as some_task([task_output.x]) a. Importantly, retain the typing as defined by the dataclass, even when only receiving an attribute (issue for complex types and for ints)
  2. Create tooling that allows flyte to use promises in the construction of a dataclass, treating dataclasses like other collection types which can be passed around as input to workflows and functions.

I think this would also resolve issues like https://github.com/flyteorg/flyte/issues/5427

Describe alternatives you've considered

We could use try @eager execution mode for this, but that seems to change a lot of the semantics of flyte as well as not have widespread support just yet. Plus, it would require significant refactors throughout the codebase.

Propose: Link/Inline OR Additional context

I've taken first stabs at pieces of this:

Overall I think this is a huge quality-of-life win for using Flyte with dataclasses, and I'm happy to actually work out the implementation, but I feel like I'm missing some pieces and context on the approach.

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

runllm[bot] commented 3 months ago

The issue you're encountering is related to Flyte's handling of complex types, particularly nested dataclasses. According to this GitHub issue, nested dataclasses are not fully supported in Flyte's type system, which can lead to type mismatches during workflow compilation. This limitation forces users to create numerous wrappers for inputs and outputs, complicating the workflow.

Would you like more detail on potential workarounds or further explanation of the issue?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

kumare3 commented 2 months ago

@JackUrb you are talking about coalescing inputs and reconstructing a json (dataclass internally) from multiple dataclasses. Today flyte does support exploding a dataclass into individual attributes in the engine (hence you can do so), but not combining them. This is more tricky as Flytepropeller does not know the structure of the json to create. What we need is a json template that can be assembled.

We are discussing this internally, would love to see community support. would you be open to contributing?

JackUrb commented 2 months ago

I'm 100% interested in contributing here. I've been following along on the JSON IDL discussions, but that seems to be just the first step towards something like this being that it creates a fixed scalar rather than a collection type.

JackUrb commented 2 weeks ago

Hi @kumare3 - revisiting this now that the upcoming 1.14 release includes the message pack binary from https://github.com/flyteorg/flyte/pull/5742, and is thus a step closer to this functionality.

I imagine the next step forwards from here would be something that includes both the required typing for promises that are included in a dataclass-with-promises as well as an indicator for the FlytePropeller backend that can list the required promises that need to be resolved before the dataclass can be resolved.

On the flytekit local side, it ultimately ends up as simple as extending the recursive resolver (here: https://github.com/flyteorg/flytekit/pull/2828) to support dataclass as something to expand (as I attempted on an older fork here), but I don't necessarily have a deep enough understanding on where to go poking to replicate the same behavior on the propeller side.