Open ryankarlos opened 1 year ago
Thank you for opening your first issue here! 🛠
@samhita-alla @pingsutw As requested added this issue to implement as requested https://github.com/flyteorg/flytekit/pull/1230#discussion_r1007602826
Maybe worth discussing what threshold to use for dataframe size to trigger multipart upload (in vaex docs - default is to export chunks of 1M rows).
Im happy to work on this.
@pingsutw do we not want to implement this anymore ? That PR does not implement multipart export and load
Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏
Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏
Motivation: Why do you think this is important?
vaex plugin currently implemented https://github.com/flyteorg/flytekit/pull/1230 only supports writing chunks to single file using
df.export(...)
as explained in the docs. We would ideally also like to support cases where if the dataframe is large, we can export the chunks in parallel to multiple parts usingdf.export_many(...)
as demostrated in the vaex api docs https://vaex.readthedocs.io/en/docs/api.html#vaex.dataframe.DataFrameLocal.export_manyGoal: What should the final outcome look like, ideally?
if df chunksize is greater than some threshold (1M ?) than should serialise to blob using
df.export_many(...)
, otherwise default to current implementation ofdf.export
https://github.com/flyteorg/flytekit/blob/8ae879eb379acf2e0b4923f1b0c855d01a1f14e5/plugins/flytekit-vaex/flytekitplugins/vaex/sd_transformers.py#L29-L38
When decoding use
df.open_many(...)
if dir has multiple files/parts (use glob pattern) ordf.open(...)
(as currently implemented) if just single part.https://github.com/flyteorg/flytekit/blob/8ae879eb379acf2e0b4923f1b0c855d01a1f14e5/plugins/flytekit-vaex/flytekitplugins/vaex/sd_transformers.py#L51-L57
Describe alternatives you've considered
N/A
Propose: Link/Inline OR Additional context
See discussion in this PR https://github.com/flyteorg/flytekit/pull/1230#discussion_r1007492873
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?