delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.64k stars 1.71k forks source link

[Feature Request]Create a Logstore that utilizes S3 APIs but utilizes something cloud-agnostic containerized solution like postgres. #1441

Open raghav-vk opened 2 years ago

raghav-vk commented 2 years ago

Feature request

Create a Logstore that utilizes S3 APIs but utilizes something cloud-agnostic containerized solution like Postgres.

Overview

While Delta Lake has supported concurrent reads from multiple clusters since its inception, there were limitations for multi-cluster writes specifically to S3. S3 lacks "put-If-Absent" consistency guarantees, and MinIO has standardized S3 as the API for multi-cloud object storage deployment. Thus, to guarantee ACID transactions on S3 for multi-cloud object storage, one must have concurrent writes originating from the same Apache Spark™ driver. MinIO has examined this use case to see if we can support "reject-if-exists" with PUT API. However, since MinIO supports active-active replication, it would introduce cross-cluster checks, which would heavily impact performance and deviates from the drop-in replacement for AWS S3. This request is specifically to create a LogStore that utilizes S3 APIs but leverages something generic cloud agnostic containerized solution.

Motivation

This feature request would enable delta lake usage with multi-cloud object storage with S3, irrespective of where it would be deployed. This will remove the ACID transaction capability restriction limited to AWS S3 and make it ubiquitously available for anyone deployed in any location, irrespective of public, private on-premise deployments, or edge deployments.

Further details

The current implementation of S3 LogStore is limiting the usage of delta for ACID transactions capable applications to only work with AWS S3. By completing this Issue/feature, delta will be usable along with MinIO in multi-cloud deployments.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute to the implementation of this feature?

dennyglee commented 2 years ago

Thanks, @raghav-vk - I appreciate you adding this issue.

harshavardhana commented 1 year ago

MinIO implements optimistic concurrency now via If-Match support in PutObject API calls now.

scottsand-db commented 1 year ago

Have you seen the S3DynamoDBLogStore?

It extends the BaseExternalLogStore which acts as an abstract parent that allows children to implement the mutual exclusion using some external service.

Does this seem like the direction you wish to go? If not, could you perhaps provide a bit more concrete of an example/ask?

orionmoana commented 1 year ago

I'm also running into this issue. We provide both cloud and on-premise deployments, using S3 for cloud and Minio for on-premise. Obviously we can't use DynamoDB on-premise.

@scottsand-db that type of solution would satisfy my requirements. I see in the original PRs (https://github.com/delta-io/delta/pull/339, https://github.com/delta-io/delta/pull/1044) there's reference to a Zookeeper implementation? That or any other open source option would be ideal.

scottsand-db commented 1 year ago

@orionmoana would you be interested in helping to contribute a Zookeeper implementation of the BaseExternalLogStore?

orionmoana commented 1 year ago

Potentially, though I can't commit to it right now.

As I mentioned the DynamoDB log store PRs both mention a Zookeeper implementation already existing, even if not complete. Is that available as a starting point?

chgl commented 11 months ago

ScyllaDB's Alternator was mentioned as a DynamoDB-compatible alternative in https://github.com/delta-io/delta/issues/1336#issuecomment-1446963918. From a very cursory glance at the code, it looks like all that is required may be adding a setEndpoint and corresponding config to https://github.com/delta-io/delta/blob/master/storage-s3-dynamodb/src/main/java/io/delta/storage/S3DynamoDBLogStore.java#L327.