Open zsxwing opened 2 years ago
Do we have to write a dockerfile?
@RishiKumarRay I think it's better to maintain a dockerfile in GitHub and we build and publish a new image when making a new release. Do you have any other suggestion?
@zsxwing yeah i think its great to maintain a dockerfile. i can see a dockerfile here
so we need a github actions to publish images
@zsxwing what is the intended entrypoint for the docker image? the various different shell options?
so we need a github actions to publish images
No. We don't need a github actions. Currently, our release process is manual. We will build the image from the dockerfile manually when making a new release.
what is the intended entrypoint for the docker image? the various different shell options?
Yep. I think the default can be pyspark
. We can also provide options for spark-shell, or even a jupyter notebook.
@zsxwing @RishiKumarRay - any updates on this issue?
I'm looking for a simple Java (non-Spark) Docker Container that let's anyone reproduce the "Zappy" fake example in the Delta docs and in this Databricks blog - in a simple "Hello World" manner without installing anything locally.
It's not easy for someone coming into the Delta ecosystem to get up to speed - having a reproducible example managed as a Dockerfile in this repo would be a phenomenal value add.
I see the PR by @stikkireddy - I don't think this meets the use case, because he's using Spark in his Dockerfile - and its just a wrapper around Jupyter notebooks (i.e. a simplified Databricks).
Rather, there should be a vanilla Java Dockerfile that has nothing to do with Apache Spark to uphold the theme of the Standalone writer - let non-Spark clients become a first class Delta citizen. It's difficult for devs to do this if the barrier to entry is high, a Dockerfile would reduce this to 5 mins.
It should be trivial to create such a Dockerfile for the maintainers of this repo - your help is appreciated!
We will run spark job base on delta lake on top of K8S and also need such a Dockerfile
@zsxwing is this what you were looking for? https://github.com/spaceandtimelabs/docker-spark-deltalake
@zsxwing - looks like there is already a PR open for this issue. What are your recommended next steps? If a new contributor would like to work on this issue, should they just open another PR?
Feel free to leave a comment in the open PR and pick up whatever is left.
I'll comment directly in PR #922 but for a Spark-based Docker, I'm wondering if it would be helpful if we leveraged the proposed Spark docker per SPIP: Support Docker Official Image for Spark.
Thanks @alberttwong - completely spaced out on this!
Yes, we have both the docker code as you noted as well as we pushed the Delta docker to DockerHub at http://go.delta.io/dockerhub.
I'm thinking that we need to create a separate docker repo (e.g., https://github.com/delta-io/docker) so we can automate the docker builds. Saying this, as the docker has been created, thinking this may be appropriate to close. Will leave it open for a short while for more comments before closing it, eh?!
Currently it's not convenient to try out Delta Lake. People need to install Spark first, and following multiple steps in https://docs.delta.io/latest/quick-start.html#set-up-apache-spark-with-delta-lake
It would be great if we can publish a docker image for Delta Lake so that people can try it out using a simple docker command.
For such docker image, we can maintain a Dockerfile in the GitHub repo and publish a new image to https://hub.docker.com/u/deltaio in each release.