cuebook / cuelake

Use SQL to build ELT pipelines on a data lakehouse.
https://cuelake.cuebook.ai
Apache License 2.0
283 stars 28 forks source link

Can we used minio as S3 compatible for apache iceberg #6

Open zainal-abidin-assegaf opened 3 years ago

zainal-abidin-assegaf commented 3 years ago

Is your feature request related to a problem? Please describe. Can we used minio as S3 compatible for apache iceberg

Describe the solution you'd like Can we used minio as S3 compatible for apache iceberg

Describe alternatives you've considered If we can use minio, need the steps to configure minio with cuelake

Additional context Can we used minio as S3 compatible for apache iceberg

vikrantcue commented 3 years ago

We have not used Minio yet, but as we can see in the Minio documentation that it is compatible with S3 APIs and also configurable with Spark applications, so it should work fine with Cuelake as well.

Steps for custom configurations will be updated soon, we are still figuring out the best way to support custom configurations.

Will keep this issue open until we update the documentation for custom configurations like this.

zainal-abidin-assegaf commented 3 years ago

if We can use minio with cuelake, what about AWS glue. Can we use Hive Metastore ??

zainal-abidin-assegaf commented 3 years ago

We still confused how zeppelin connect to spark cluster ?? Are we just deploy spark with cuelake namespace is enough ?? Or maybe we can predefined :

vikrantcue commented 3 years ago

if We can use minio with cuelake, what about AWS glue. Can we use Hive Metastore ??

Yes, you can use both AWS Glue and Hive as metastore for Iceberg.

Cuelake's default configuration is hive metastore with postgres as backend database.

We still confused how zeppelin connect to spark cluster ?? Are we just deploy spark with cuelake namespace is enough ?? Or maybe we can predefined :

  • spark master endpoint
  • redis endpoint
  • minio endpoint in the configmap ??
  1. Spark driver and executors are created by Zeppelin when any notebook is run, hence the spark master endpoint is set to k8s://https://kubernetes.default.svc
  2. Redis in cuelake is being used for maintaining celery jobs queue and the endpoint is set by default as http://redis:6379
  3. Minio endpoint can be passed as spark config. Spark config can be configured via either zeppelin interpreter settings or defined in a notebook before starting the interpreter. Not so sure about this as we haven't tested Minio yet.
zainal-abidin-assegaf commented 3 years ago

@vikrantcue , thank you for your confirmation.

Looking forward to hear update for minio sucessful integration and tested with cuelake,

If it possible, minio endpoint config in the configmap for better user experience,

Cuelake can be one of the fastest etl/elt due to spark cluster and iceberg with object storage,

We are looking forward for minio update, thank you