flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.18k stars 550 forks source link

[Core feature] [Flytekit] Implement Multipart upload and download for flyteplugins-vaex #3036

Open ryankarlos opened 1 year ago

ryankarlos commented 1 year ago

Motivation: Why do you think this is important?

vaex plugin currently implemented https://github.com/flyteorg/flytekit/pull/1230 only supports writing chunks to single file using df.export(...) as explained in the docs. We would ideally also like to support cases where if the dataframe is large, we can export the chunks in parallel to multiple parts using df.export_many(...) as demostrated in the vaex api docs https://vaex.readthedocs.io/en/docs/api.html#vaex.dataframe.DataFrameLocal.export_many

Goal: What should the final outcome look like, ideally?

if df chunksize is greater than some threshold (1M ?) than should serialise to blob using df.export_many(...) , otherwise default to current implementation of df.export

https://github.com/flyteorg/flytekit/blob/8ae879eb379acf2e0b4923f1b0c855d01a1f14e5/plugins/flytekit-vaex/flytekitplugins/vaex/sd_transformers.py#L29-L38

When decoding use df.open_many(...) if dir has multiple files/parts (use glob pattern) or df.open(...) (as currently implemented) if just single part.

https://github.com/flyteorg/flytekit/blob/8ae879eb379acf2e0b4923f1b0c855d01a1f14e5/plugins/flytekit-vaex/flytekitplugins/vaex/sd_transformers.py#L51-L57

Describe alternatives you've considered

N/A

Propose: Link/Inline OR Additional context

See discussion in this PR https://github.com/flyteorg/flytekit/pull/1230#discussion_r1007492873

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome[bot] commented 1 year ago

Thank you for opening your first issue here! 🛠

ryankarlos commented 1 year ago

@samhita-alla @pingsutw As requested added this issue to implement as requested https://github.com/flyteorg/flytekit/pull/1230#discussion_r1007602826

Maybe worth discussing what threshold to use for dataframe size to trigger multipart upload (in vaex docs - default is to export chunks of 1M rows).

Im happy to work on this.

ryankarlos commented 1 year ago

take

pingsutw commented 1 year ago

fixed by https://github.com/flyteorg/flytekit/pull/1230#discussion_r1007602826

ryankarlos commented 1 year ago

@pingsutw do we not want to implement this anymore ? That PR does not implement multipart export and load

github-actions[bot] commented 10 months ago

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions[bot] commented 10 months ago

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏