apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
5.85k stars 2.06k forks source link

any plan for Iceberg Table on S3? #1468

Closed Lindayangyy closed 3 months ago

Lindayangyy commented 3 years ago

New to Apache Iceberg, We are looking for Iceberg Table or warehouse (catalog) implementation upon S3, if without any reference to Hive and HDFS (hadoop) is possible? The current implementation seems tightly coupled with Hive and hadoop.

RussellSpitzer commented 3 years ago

You can use it with S3 with Hadoop client libraries only, you don't actually need a Hadoop cluster or HDFS.

HeartSaVioR commented 3 years ago

Supporting S3 requires Hive, because of S3's characteristic, eventual consistency. I see OSP version of Delta Lake solved it in different way, but pretty much limited. (It assumes concurrent writes for S3 only happen in "a" Spark driver. https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/storage/S3SingleDriverLogStore.scala)

aokolnychyi commented 3 years ago

Iceberg works reliably with s3 even if the same table is accessed via multiple clusters and query engines. Using Iceberg requires a catalog that can swap a pointer to the metadata file atomically. This can be done using a compare and swap or lock/unlock API. Iceberg contains a built-in implementation that uses Hive metastore to work with s3 reliably (lock/unlock). Anyone could easily build an integration for any catalog. For example, one may have a Cassandra-based catalog and use compare and swap to commit new table versions. That will be enough to work with s3 reliably.

jacques-n commented 3 years ago

We've been working on a non-Hive way to provide this functionality and plan on contributing it to the project within the next two weeks.

Lindayangyy commented 3 years ago

That will be awesome, can't wait to see it. Thank you - jacques-n!

Lindayangyy commented 3 years ago

Thanks for all the responses as alternatives. All answers are great!

HeartSaVioR commented 3 years ago

That sounds great! Assuming it still needs to do CAS with external storage (I'd be really curious if it doesn't rely on the external storage) which is that? Is it one of AWS services? If then even better, as there's no external dependency outside of AWS. Given we assume to use S3, which is already locked-in.

jacques-n commented 3 years ago

We're doing something pluggable but the default implementation is on top of DynamoDB.

ismailsimsek commented 3 years ago

is it possible to write JDBC based catalog? that could unlock many catalog option

kbendick commented 3 years ago

We're doing something pluggable but the default implementation is on top of DynamoDB.

That's a good idea. I know that AWS Glue is backed by DynamoDB, so if you can make a catalog using Dynamo, then possibly the AWS team can implement the atomic swap in Glue. If I'm not mistaken, you'd need to use either read / write consistency or possibly a DynamoDB versioned object.

Looking forward to seeing the DynamoDB catalog as I assume many companies looking to write to S3 are also likely using DynamoDB. I know that my company uses DynamoDB a ton so this would be a great work around until there is Glue Catalog support (which I've been giving some thought to myself).

jackye1995 commented 3 years ago

Hi @jacques-n this is Jack from AWS. We are planning to introduce a new iceberg-aws module, and we do have plan to offer a Glue + DynamoDB implementation for Catalog and TableOperations. Since you say you already have something working, let's have a sync after you have a PR and see what is the best way to have this shipped all together 😃

jacques-n commented 3 years ago

Hey guys, we just posted more information on the new stuff we've been building for Iceberg + DynamoDB. You can check it out here: https://projectnessie.org/

We'll have a PR up against Iceberg shortly to contribute the Iceberg integrations: https://github.com/projectnessie/nessie/tree/main/clients/iceberg

RussellSpitzer commented 3 years ago

Very cool!

On Thu, Oct 1, 2020 at 4:34 PM Jacques Nadeau notifications@github.com wrote:

Hey guys, we just posted more information on the new stuff we've been building for Iceberg + DynamoDB. You can check it out here: https://projectnessie.org/

We'll have a PR up against Iceberg shortly to contribute the Iceberg integrations: https://github.com/projectnessie/nessie/tree/main/clients/iceberg

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/apache/iceberg/issues/1468#issuecomment-702410473, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADE2YKA6G5T55NR2OUSRVLSITYWXANCNFSM4RPIUBNQ .

jackye1995 commented 3 years ago

I just sent out a PR for AWS Glue support. With this update you can use HiveCatalog without the need to set up any Hive infrastructure and build your data lake on top of S3. #1608

jackye1995 commented 3 years ago

For anyone new to this issue, I think we have summarized all information in https://iceberg.apache.org/aws/, and we can close this issue. @Lindayangyy

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 3 months ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'