delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.98k stars 365 forks source link

documentation: concurrent writes for non-S3 backends #2556

Closed inigohidalgo closed 1 month ago

inigohidalgo commented 1 month ago

Currently the documentation has a section that details a locking mechanism is needed for S3 to enable concurrent writing.

https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/

There are various mentions of concurrency throughout the docs, but having read through the docs a while ago I was left with the impression that "concurrent writing is only supported on S3, and you need a DynamoDB locking mechanism" when, in reality, concurrent writing is supported by default on (at least one) backends #2069 without needing that locking provider.

This is probably just my own lack of understanding of the delta protocol, but I think it would be good to make this clearer in the documentation, that concurrency is supported by default, and only S3 needs the locking mechanism.

ion-elgreco commented 1 month ago

@inigohidalgo feel free to open a PR to make a change to the docs, contributions are always welcome! :)

inigohidalgo commented 1 month ago

Sounds good. Is it safe to assume that all the backends listed here other than AWS support this by default?

I will probably just reword the write_deltalake docstring as that is what initially tripped me up, and add a small note at the start of this page https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/

wjones127 commented 1 month ago

Is it safe to assume that all the backends listed here other than AWS support this by default?

Yup. Though IIRC, Minio, Cloudflare, and other S3-compatible stores will have the same issue, even though we actually could enable it for some of them.

ion-elgreco commented 1 month ago

@wjones127 Cloudflare R2 actually supports copy if not exists with custom headers, which we are able to pass through. But that's the only exception for an S3 implementation afaik

wjones127 commented 1 month ago

Cloudflare R2 actually supports copy if not exists with custom headers, which we are able to pass through

Ah cool. I know R2 and Minio support using custom headers, but didn't know we had already implemented the proper pass through for R2. Do we have support for Minio as well then?

ion-elgreco commented 1 month ago

Cloudflare R2 actually supports copy if not exists with custom headers, which we are able to pass through

Ah cool. I know R2 and Minio support using custom headers, but didn't know we had already implemented the proper pass through for R2. Do we have support for Minio as well then?

Hmm I didn't know minio supported custom headers, then I guess it should work as well (but can't confirm)