flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.43k stars 581 forks source link

[Feature][Flytekit Schema type extension] Vaex Dataframe plugin #701

Closed kumare3 closed 1 year ago

kumare3 commented 3 years ago

Motivation: Why do you think this is important? Flytekit should support Vaex as a pandas alternative for FlyteSchema object. https://github.com/vaexio/vaex

Vaex has great performance on a single machine, which is usually needed for most datasets. Spark & Dask are overkill with lots of complexity for datasets of sizes in few gigabytes. The addition of Vaex and support for automatic serialization and deserialization between consecutive tasks using Arrow/HDF5 would allow great Pandas, Spark, and Vaex interoperability.

Goal: What should the final outcome look like, ideally? Users should be able to retrieve Vaex Dataframes from a FlyteSchema

def foo(f: FlyteSchema):
    df = f.open(type=vaex.DataFrame)
    ...

Also support for Vaex Dataframe as a type

def foo(f: vaex.DataFrame) -> vaex.DataFrame:
   pass

The plugin should mostly look like the default Pandas DataFrame Transformer and Reader that ships with Flytekit https://github.com/flyteorg/flytekit/blob/master/flytekit/types/schema/types_pandas.py#L88-L144

Or like the Spark Plugin support for Spark DataFrames like https://github.com/flyteorg/flytekit/blob/f0b0a7ed854950a3341df710d1f378ef3ed838ab/plugins/flytekit-spark/flytekitplugins/spark/schema.py#L13-L81

Describe alternatives you've considered NA

Flyte component

GitHub repo(s) flytekit

ryankarlos commented 1 year ago

@samhita-alla Ive added PR https://github.com/flyteorg/flytekit/pull/1230 for this issue. Could this be assigned to me please. Also could you please add Hacktoberfest label to my PR as well, thanks !