Open raghav-vk opened 2 years ago
Thanks, @raghav-vk - I appreciate you adding this issue.
MinIO implements optimistic concurrency now via If-Match
support in PutObject API calls now.
Have you seen the S3DynamoDBLogStore?
It extends the BaseExternalLogStore which acts as an abstract parent that allows children to implement the mutual exclusion using some external service.
Does this seem like the direction you wish to go? If not, could you perhaps provide a bit more concrete of an example/ask?
I'm also running into this issue. We provide both cloud and on-premise deployments, using S3 for cloud and Minio for on-premise. Obviously we can't use DynamoDB on-premise.
@scottsand-db that type of solution would satisfy my requirements. I see in the original PRs (https://github.com/delta-io/delta/pull/339, https://github.com/delta-io/delta/pull/1044) there's reference to a Zookeeper implementation? That or any other open source option would be ideal.
@orionmoana would you be interested in helping to contribute a Zookeeper implementation of the BaseExternalLogStore?
Potentially, though I can't commit to it right now.
As I mentioned the DynamoDB log store PRs both mention a Zookeeper implementation already existing, even if not complete. Is that available as a starting point?
ScyllaDB's Alternator was mentioned as a DynamoDB-compatible alternative in https://github.com/delta-io/delta/issues/1336#issuecomment-1446963918. From a very cursory glance at the code, it looks like all that is required may be adding a setEndpoint
and corresponding config to https://github.com/delta-io/delta/blob/master/storage-s3-dynamodb/src/main/java/io/delta/storage/S3DynamoDBLogStore.java#L327.
Feature request
Create a Logstore that utilizes S3 APIs but utilizes something cloud-agnostic containerized solution like Postgres.
Overview
While Delta Lake has supported concurrent reads from multiple clusters since its inception, there were limitations for multi-cluster writes specifically to S3. S3 lacks "put-If-Absent" consistency guarantees, and MinIO has standardized S3 as the API for multi-cloud object storage deployment. Thus, to guarantee ACID transactions on S3 for multi-cloud object storage, one must have concurrent writes originating from the same Apache Spark™ driver. MinIO has examined this use case to see if we can support "reject-if-exists" with PUT API. However, since MinIO supports active-active replication, it would introduce cross-cluster checks, which would heavily impact performance and deviates from the drop-in replacement for AWS S3. This request is specifically to create a LogStore that utilizes S3 APIs but leverages something generic cloud agnostic containerized solution.
Motivation
This feature request would enable delta lake usage with multi-cloud object storage with S3, irrespective of where it would be deployed. This will remove the ACID transaction capability restriction limited to AWS S3 and make it ubiquitously available for anyone deployed in any location, irrespective of public, private on-premise deployments, or edge deployments.
Further details
The current implementation of S3 LogStore is limiting the usage of delta for ACID transactions capable applications to only work with AWS S3. By completing this Issue/feature, delta will be usable along with MinIO in multi-cloud deployments.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute to the implementation of this feature?