flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.45k stars 583 forks source link

[Core Feature] Support for enum types used in dataclasses #1514

Closed palchicz closed 2 years ago

palchicz commented 3 years ago

Motivation: Why do you think this is important? Since dataclasses are a supported type and Enums are a supported type, one would expect that dataclasses that reference a enum should also be supported. Instead, the protobuf fails to form properly and I get the following warning from the command

pyflyte --pkgs <PACKAGE_NAME> serialize --image <IMAGE_NAME> workflows

WARNING:flytekit:failed to extract schema for object <class 'xxx.yyy.zzz'>, (will run schemaless) error: Currently do not support JSON schema for enums loaded by value

Goal: What should the final outcome look like, ideally? I should be able to define a dataclass and enum like the following:

class Animal(Enum):
    DOG = "dog"
    CAT = "cat"

@dataclass_json
@dataclass
class Person(object):
    pet: Animal = Animal.DOG

Describe alternatives you've considered I've considered implementing a new custom type.

[Optional] Propose: Link/Inline OR Additional context This might have its root cause in this dependency https://github.com/fuhrysteve/marshmallow-jsonschema/blob/master/marshmallow_jsonschema/base.py#L228

welcome[bot] commented 3 years ago

Thank you for opening your first issue here! đź› 

kumare3 commented 3 years ago

@palchicz this is because of the underlying dataclass-json library. We are actually working on supporting #1362 Pickle types. But, pickle is not native to other languages and wont show up in the UI. Whats the usecase?

palchicz commented 3 years ago

Thanks for the quick response! We have a collection of ~10 workflow inputs that are shared among several workflows. These inputs are a collection of closely related parameters, and they always show up together across different workflows, so it makes sense to group them together. Consolidating them into a single dataclass made sense, however, one of these shared inputs is an enum and thus we run into the problem stated above.

For the time being, we can still use the dataclass to group 9/10 of the parameters together so we don't have to repeatedly and explicitly lay them out for each workflow. It's just awkward to have the one odd parameter that we do need to add to the workflow definition explicitly just because it's an enum.

Here are some code example to help clarify what I said above:

class EnumParam(Enum):
    OPTION_1 = "option 1"
    OPTION_2 = "option 2"

# Naive approach
@workflow
def workflow_1(param_1: int, param_2: int, ...., param_n: EnumParam) -> int:
   ...

@workflow
def workflow_2(param_1: int, param_2: int, ...., param_n: EnumParam) -> str:
   ...

# Envisioned Approach
@dataclass_json
@dataclass
class WorkflowParameters(object):
    param_1: int = 0
    param_2: int = 0
    ...
    param_n: EnumParam = EnumParam.OPTION_1

@workflow
def workflow_1(workflow_params: WorkflowParameters) -> int:
   ...

@workflow
def workflow_2(workflow_params: WorkflowParameters) -> str:
   ...

# Workaround 
@dataclass_json
@dataclass
class WorkflowParameters(object):
    param_1: int = 0
    param_2: int = 0
    ...

@workflow
def workflow_1(workflow_params: WorkflowParameters, param_n: EnumParam) -> int:
   ...

@workflow
def workflow_2(workflow_params: WorkflowParameters, param_n: EnumParam) -> str:
   ...
kumare3 commented 3 years ago

@palchicz appreciate giving more clarification. your request absolutely makes sense. One follow up question, do you use the Enum Form in the UI - as shown below?

Screen Shot 2021-09-23 at 4 32 48 PM

This UI feature will not work with dataclasses with nested enums, that is because dataclass_json does not support enum schema extraction AFAICT. But JSON actually supports that. Let us try that.

In anycase enums should be supported by dataclasses, and I think we just need to enable one more library for that? marshmallow_enum - https://github.com/fuhrysteve/marshmallow-jsonschema/blob/master/marshmallow_jsonschema/base.py#L22-L27.

Can you try installing the library and help us debug, and please jump into our slack channel - here and we can work on it with you.

palchicz commented 3 years ago

One follow up question, do you use the Enum Form in the UI - as shown below?

That would be ideal, but not a deal breaker. At the very least there isn't any urgency, so maybe the dataclass_json people will add that feature before too long :smiley: :crossed_fingers:

I can certainly jump in and help. I should have some time next week. I'll join the slack channel like you suggested.

kumare3 commented 3 years ago

One follow up question, do you use the Enum Form in the UI - as shown below?

That would be ideal, but not a deal breaker. At the very least there isn't any urgency, so maybe the dataclass_json people will add that feature before too long :smiley: :crossed_fingers:

I can certainly jump in and help. I should have some time next week. I'll join the slack channel like you suggested.

Awesome, see you soon. Also, if the form is not required- enums should be supported. I think the enum marshmallow library might work

palchicz commented 2 years ago

In anycase enums should be supported by dataclasses, and I think we just need to enable one more library for that? marshmallow_enum - https://github.com/fuhrysteve/marshmallow-jsonschema/blob/master/marshmallow_jsonschema/base.py#L22-L27.

Can you try installing the library and help us debug

I took a stab at debugging this problem and noticed that marshmallow_enum is already installed. The issue is this line here in the mashmallow_jsonschema library which raises an exception when trying to convert the marshmallow enum field into a json schema enum. If you trace the code, it wouldn't get to this point unless marshmallow_enum was installed.

The exact reason it raises an exception is because the marshmallow Enum field has the property field.load_by == LoadDumpOptions.value (which gets set here). The matshmallow_json libraby is probably a little too restrictive in that it just errors out whenever it encounters this case. It should probably allow strings even when the field was loaded by value.

Anyways, that's as far as I got. No solution to propose, but that at least seems to be the issue.

kumare3 commented 2 years ago

We are working on adding more type support into Dataclass, we will take a look at this as part of the work - cc @pingsutw