Closed mpetri closed 7 months ago
Yes, we recently added an option in the S3 backend called AWS_S3_ALLOW_UNSAFE_RENAME
that allows building the S3 storage backend without any lock configured. I haven't tested yet whether it worked without compiling dynamodb dependencies though; I'll need to check on that.
@mpetri - just had a quick scan of our code, and you should be able to pass in a custom object store using the DeltaTableBuilder option with_object_store
, you could then pull in object_store as a separate crate with AWS feature. Unfortuantely you would have to write a thing wrapper, since we are calling the *_if_not_exists methods which will raise "not implemented" in the object_store crate.
That said, we should probably look into providing a feature, that allows compiling for single reader scenarios.
I'm currently blocked in the other bug I reported (can't compile create with s3 support) so I might give this a try thanks.
Should I keep this issue open? It seems like a valid request.
Yes, please keep it open.
Giving a bump to this FR as I am using an S3-compatible object store (Cloudflare R2) and would like some way to support concurrent writes across processes - currently this is managed via a single process and a Mutex.
Perhaps we could replace the locking implementation with a trait, similar to tokio::sync::Mutex
as it would likely need to span across .await
's if using an external service for locking. For example, etcd, Cloudflare Durable Objects, ZooKeeper and such.
Do you care about something that works across S3-compatible APIs? Or just about R2?
If you care specifically about R2, I think the more optimal solution is to support it through the object store rather than have some separate locking mechanism. Unlike S3, R2 has support for conditional PutObject
(docs). I think that could be used to implement a workable rename_if_not_exist
operation (or maybe the same headers are supported in Copy / Replace operations?).
(Though also note that R2 doesn't work well right now because their multi-part upload doesn't seem to be compatible with S3.)
If S3 ever comes out with support with atomic rename_if_not_exist or copy_if_not_exist, then the whole lock client thing will be moot. GCS and Azure Blog store don't need any locking client because they support these operations out-of-the-box.
I am mostly just interested in R2 - let me check with the R2 team to see if CopyObject
supports those conditional headers.
I figured switching to a trait would "plugin" better to the existing locking that uses DynamoDB, but I am fine with either approach. S3-compatible providers all have their own quirks, so that was the most straightforward approach that lets the user deal with those.
(Though also note that R2 doesn't work well right now because their multi-part upload doesn't seem to be compatible with S3.)
Kind of an aside, but can you send me details on the issue you are referencing there? I am on the Slack and would be interested in hearing it to provide the feedback to the R2 team.
@cmackenzie1 I need confirmation from the R2 team, but the implementation in object-store-rs is based on the one in Arrow C++, and I think there's an issue where they don't support non-equal part sizes: https://github.com/apache/arrow/issues/34363#issuecomment-1500972227
I followed up with the R2 team, and they confirmed that it is still the case for S3 multipart uploads requiring parts to be the same size (except the last).
For the CopyObject
operation, they do support the headers listed here: x-amz-copy-source-*
Description
In the rust crate is it possible to support s3 based delta lakes without the need to pull in and use the dynamodb lock client? I understand the need for the lock client (after reading the paper) but if I know I will ever only have one writer for the deltalake I don't really need to locking mechanism.
Could I somehow achieve this by manually creating an object store somehow (with s3 backend) and passing it to deltalake?