flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.72k stars 645 forks source link

[BUG] pd.DataFrame type doesn't work with dataclasses #3010

Open cosmicBboy opened 2 years ago

cosmicBboy commented 2 years ago

Describe the bug

using pd.DataFrame with dataclasses raises an error:

  File "/Users/nielsbantilan/miniconda3/envs/flyte-vscode-demo/lib/python3.9/site-packages/dataclasses_json/core.py", line 201, in _decode_dataclass
    init_kwargs[field.name] = _decode_generic(field_type,
  File "/Users/nielsbantilan/miniconda3/envs/flyte-vscode-demo/lib/python3.9/site-packages/dataclasses_json/core.py", line 258, in _decode_generic
    xs = _decode_items(type_.__args__[0], value, infer_missing)
AttributeError: type object 'DataFrame' has no attribute '__args__'

Expected behavior

this should work like structured dataset

Additional context to reproduce

Using this type in a task

@dataclass_json
@dataclass
class TrainArgs:
    hyperparameters: dict
    data: pd.DataFrame

@task
def prepare_train_args(hp_grid: List[dict], data: pd.DataFrame) -> List[TrainArgs]:
    return [TrainArgs(hp, data) for hp in hp_grid]

will lead to the error above

Screenshots

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

pingsutw commented 2 years ago

cc @cosmicBboy we can't use non-dataclass type in dataclass. However, you could change the type hint to data: StrucutredDataset, and it should work.

cosmicBboy commented 2 years ago

we can't use non-dataclass type in dataclass

@pingsutw is this a technical issue or a philosophical one? surely we can update the dataclass transformer to be able to handle pd.DataFrame annotations as StructuredDatasets under the hood?

cosmicBboy commented 1 year ago

An alternate solution here would be to introspect on the dataclass definition and raise an informative error pointing the user to StructuredDataset.

wild-endeavor commented 1 year ago

this is a technical issue and a limitation of python dataclass/dataclass_json, not on the flyte side. but yeah, flytekit should introspect and raise a friendly error.

github-actions[bot] commented 1 year ago

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions[bot] commented 1 year ago

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

zychen5186 commented 10 months ago

self-assign

zychen5186 commented 8 months ago

Hi, seems like this issue is already solved by the dataclass source code, please refer to https://github.com/lidatong/dataclasses-json/pull/389#issue-1446332429

pingsutw commented 8 months ago

@zychen5186, you can use strucutredDataset inside the dataclass.

@dataclass_json
@dataclass
class TrainArgs:
    hyperparameters: dict
    data: StructuredDataset