delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.97k stars 365 forks source link

fix(python): constrain multipart upload size to fixed length #2606

Closed abhiaagarwal closed 1 week ago

abhiaagarwal commented 1 week ago

Description

Object stores expected fixed lengths for all multipart upload parts right up until the last part. The original logic just flushed when it exceeded the threshold. Now, it flushes when the threshold is met exclusively with the same fixed buffer, unless we're completing the transaction, in which case the last piece is allowed to be smaller.

Bumps the constant to reflect that the minimum expected size by most object stores is 5MiB. Also adds a UserWarning if a constant is specified to be less.

Also releases the GIL in more places by moving the flushing logic to a free function.

Related Issue(s)

Closes #2605

Documentation

See: MultipartUpload docs

github-actions[bot] commented 1 week ago

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

ion-elgreco commented 1 week ago

@abhiaagarwal looks good, just one remark on the buffer size warning

abhiaagarwal commented 1 week ago

@ion-elgreco don't merge yet — I found about an edge case in my logic that won't work (what happens when remaining = max_buffer_length = len(chunk)?) and I want to fix up the test to actually parameterize it + check for the warning message. I also opened an upstream issue in object_store to pull out the "fixed buffer" logic into its own struct so it can live upstream, for the future :)

rtyler commented 1 week ago

marking this as a draft because that also helps me of void hitting that big tempting green button