delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

Publish a docker image for Delta Lake #919

Open zsxwing opened 2 years ago

zsxwing commented 2 years ago

Currently it's not convenient to try out Delta Lake. People need to install Spark first, and following multiple steps in https://docs.delta.io/latest/quick-start.html#set-up-apache-spark-with-delta-lake

It would be great if we can publish a docker image for Delta Lake so that people can try it out using a simple docker command.

For such docker image, we can maintain a Dockerfile in the GitHub repo and publish a new image to https://hub.docker.com/u/deltaio in each release.

RishiKumarRay commented 2 years ago

Do we have to write a dockerfile?

zsxwing commented 2 years ago

@RishiKumarRay I think it's better to maintain a dockerfile in GitHub and we build and publish a new image when making a new release. Do you have any other suggestion?

RishiKumarRay commented 2 years ago

@zsxwing yeah i think its great to maintain a dockerfile. i can see a dockerfile here

RishiKumarRay commented 2 years ago

so we need a github actions to publish images

stikkireddy commented 2 years ago

@zsxwing what is the intended entrypoint for the docker image? the various different shell options?

zsxwing commented 2 years ago

so we need a github actions to publish images

No. We don't need a github actions. Currently, our release process is manual. We will build the image from the dockerfile manually when making a new release.

what is the intended entrypoint for the docker image? the various different shell options?

Yep. I think the default can be pyspark. We can also provide options for spark-shell, or even a jupyter notebook.

mdrakiburrahman commented 2 years ago

@zsxwing @RishiKumarRay - any updates on this issue?

I'm looking for a simple Java (non-Spark) Docker Container that let's anyone reproduce the "Zappy" fake example in the Delta docs and in this Databricks blog - in a simple "Hello World" manner without installing anything locally.

It's not easy for someone coming into the Delta ecosystem to get up to speed - having a reproducible example managed as a Dockerfile in this repo would be a phenomenal value add.

I see the PR by @stikkireddy - I don't think this meets the use case, because he's using Spark in his Dockerfile - and its just a wrapper around Jupyter notebooks (i.e. a simplified Databricks).

Rather, there should be a vanilla Java Dockerfile that has nothing to do with Apache Spark to uphold the theme of the Standalone writer - let non-Spark clients become a first class Delta citizen. It's difficult for devs to do this if the barrier to entry is high, a Dockerfile would reduce this to 5 mins.

It should be trivial to create such a Dockerfile for the maintainers of this repo - your help is appreciated!

allenhaozi commented 2 years ago

We will run spark job base on delta lake on top of K8S and also need such a Dockerfile

avantgardnerio commented 2 years ago

@zsxwing is this what you were looking for? https://github.com/spaceandtimelabs/docker-spark-deltalake

MrPowers commented 2 years ago

@zsxwing - looks like there is already a PR open for this issue. What are your recommended next steps? If a new contributor would like to work on this issue, should they just open another PR?

zsxwing commented 2 years ago

Feel free to leave a comment in the open PR and pick up whatever is left.

dennyglee commented 2 years ago

I'll comment directly in PR #922 but for a Spark-based Docker, I'm wondering if it would be helpful if we leveraged the proposed Spark docker per SPIP: Support Docker Official Image for Spark.

alberttwong commented 1 year ago

https://github.com/delta-io/delta-docs/tree/main/static/quickstart_docker

dennyglee commented 1 year ago

Thanks @alberttwong - completely spaced out on this!

Yes, we have both the docker code as you noted as well as we pushed the Delta docker to DockerHub at http://go.delta.io/dockerhub.

I'm thinking that we need to create a separate docker repo (e.g., https://github.com/delta-io/docker) so we can automate the docker builds. Saying this, as the docker has been created, thinking this may be appropriate to close. Will leave it open for a short while for more comments before closing it, eh?!