delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.97k stars 365 forks source link

0.18.1 reintroduces S3 multipart upload bug #2605

Closed Zan-L closed 1 week ago

Zan-L commented 1 week ago

Environment

Delta-rs version: 0.18.1

Binding: Python

Environment:


Bug

What happened: Same error as https://github.com/delta-io/delta-rs/issues/890 but on S3 directly instead of non-S3

What you expected to happen:

How to reproduce it: Write regular size data to S3 with write_deltalake()

More details: 0.18.0 works fine

abhiaagarwal commented 1 week ago

Can you give a reproducible example / any details about the size of the table / amount of writes? A log report? I wrote the modified code for 0.18.1 and it seems to have fixed other people's problems — I'll take a look at it :)

Zan-L commented 1 week ago

Hi,

Thank you for the prompt response. Unfortunately, that happened in our enterprise dev environment so I can't provide the proprietary data. I can provide two more observations though:

abhiaagarwal commented 1 week ago

Understandable if you can't share enterprise data, but even a censored error code/log would be great!

But yeah, the behavior in 0.18.1 changed so that any writes buffer into an in-memory buffer, which flushes when it exceeds the threshold set. The threshold per the config is const DEFAULT_MAX_BUFFER_SIZE: usize = 4 * 1024 * 1024 ~ 4 MiB. Can you try tuning it higher by setting max_buffer_size key the storage_options dict when loading a table to a higher value?

object_store documentation says it should be at least 5 MiB, so this value should probably be tuned on our side anyways.

Zan-L commented 1 week ago

That is the root cause. I did a test write with 5*2**20 and it worked this time. Can you push another release with this fix?

Btw, the error message from before:

OSError: Generic S3 error: Error performing complete multipart request: Client error with status 400 Bad Request: <Error><Code>EntityTooSmall</Code><Message>Your proposed upload is smaller than the minimum allowed size</Message><ProposedSize>4328982</ProposedSize><MinSizeAllowed>5242880</MinSizeAllowed><PartNumber>1</PartNumber>

abhiaagarwal commented 1 week ago

I don't control the release cycles, but you can compile from source with that fix! The current version will also work if you set that option.