Motivation: Why do you think this is important?
Flytekit should support Vaex as a pandas alternative for FlyteSchema object.
https://github.com/vaexio/vaex
Vaex has great performance on a single machine, which is usually needed for most datasets. Spark & Dask are overkill with lots of complexity for datasets of sizes in few gigabytes. The addition of Vaex and support for automatic serialization and deserialization between consecutive tasks using Arrow/HDF5 would allow great Pandas, Spark, and Vaex interoperability.
Goal: What should the final outcome look like, ideally?
Users should be able to retrieve Vaex Dataframes from a FlyteSchema
@samhita-alla Ive added PR https://github.com/flyteorg/flytekit/pull/1230 for this issue. Could this be assigned to me please.
Also could you please add Hacktoberfest label to my PR as well, thanks !
Motivation: Why do you think this is important? Flytekit should support Vaex as a pandas alternative for FlyteSchema object. https://github.com/vaexio/vaex
Vaex has great performance on a single machine, which is usually needed for most datasets. Spark & Dask are overkill with lots of complexity for datasets of sizes in few gigabytes. The addition of Vaex and support for automatic serialization and deserialization between consecutive tasks using Arrow/HDF5 would allow great Pandas, Spark, and Vaex interoperability.
Goal: What should the final outcome look like, ideally? Users should be able to retrieve Vaex Dataframes from a FlyteSchema
Also support for Vaex Dataframe as a type
The plugin should mostly look like the default Pandas DataFrame Transformer and Reader that ships with Flytekit https://github.com/flyteorg/flytekit/blob/master/flytekit/types/schema/types_pandas.py#L88-L144
Or like the Spark Plugin support for Spark DataFrames like https://github.com/flyteorg/flytekit/blob/f0b0a7ed854950a3341df710d1f378ef3ed838ab/plugins/flytekit-spark/flytekitplugins/spark/schema.py#L13-L81
Describe alternatives you've considered NA
Flyte component
GitHub repo(s) flytekit