flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.73k stars 649 forks source link

[BUG] [Flytekit] Mismatching Type while serializing Union Types in `_make_dataclass_serializable` #5910

Open mao3267 opened 1 week ago

mao3267 commented 1 week ago

Describe the bug

Mentioned in the discussion under flyteorg/flytekit#2823. When handling Union types in _make_dataset_serializable, we intend to select the first argument in get_args(python_type) because Optional is equivalent to Union[ExpectedType, None]. However, this approach may fail if the type order differs, for example, in Union[None, ExpectedType].

An error reproduction example:

from dataclasses import dataclass
from flytekit import task
from flytekit.types.file import FlyteFile
from typing import Union

@dataclass
class InnerDC:
    ff: Union[None, FlyteFile]

@dataclass
class DC:
    inner_dc: InnerDC

@task
def t_dc() -> DC:
    return DC(inner_dc=InnerDC(ff="s3://path"))

Expected behavior

This code snippet should work, meaning that the underlying Union types in dataclasses should be serialized correctly.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

mao3267 commented 1 week ago

Here is another example I have come up with:

from dataclasses import dataclass
from flytekit import task
from typing import Union

@dataclass
class InnerDC:
    ff: Union[None, int, str]

@dataclass
class DC:
    inner_dc: InnerDC

@task
def t_dc() -> DC:
    return DC(inner_dc=InnerDC(ff="string"))

Since Union can contain multiple type candidates, I’m wondering if we should consider supporting this as well? cc @Future-Outlier @wild-endeavor

Future-Outlier commented 1 week ago

Here is another example I have come up with:

from dataclasses import dataclass
from flytekit import task
from typing import Union

@dataclass
class InnerDC:
    ff: Union[None, int, str]

@dataclass
class DC:
    inner_dc: InnerDC

@task
def t_dc() -> DC:
    return DC(inner_dc=InnerDC(ff="string"))

Since Union can contain multiple type candidates, I’m wondering if we should consider supporting this as well? cc @Future-Outlier @wild-endeavor

YES DO IT