This project is not being maintained anymore.
This project provides docker images and marathon app definitions to run Apache Zeppelin on DC/OS. Images are published to Docker Hub.
This is a custom-built image based on the mesosphere spark docker image because the official Zeppelin docker image does not contain the neccessary libraries for mesos and can not be configured for the extra features that are possible with DC/OS (see below).
This project is also available as a DC/OS universe package. Install it using dcos package install zeppelin
.
The Spark interpreter in Zeppelin can currently not be used on a DC/OS EE cluster with strict security mode enabled. Please use a cluster configured with disabled or permissive security mode.
$ dcos package install zeppelin
This DC/OS Service is currently in preview.
Continue installing? [yes/no] yes
Installing Marathon app for package [zeppelin] version [1.1-0.8.1-2.4.0]
DC/OS Zeppelin is being installed!
Documentation: https://github.com/dcos/examples/tree/master/zeppelin/1.11
Issues: https://dcos.io/community or
https://github.com/MaibornWolff/dcos-zeppelin
For detailed instructions on using the universe package see the DC/OS Examples repo.
SPARK_MESOS_EXECUTOR_DOCKER_IMAGE
SPARK_CORES_MAX
and SPARK_EXECUTOR_MEMORY
depending on your cluster size and available resourcesAll features are also available in the Universe package. Use dcos package describe zeppelin --config
to get a complete list of possible configuration options.
The docker image is built to store notebook data on a persistent volume. To use it add a volume definition to the app
{
"container": {
"volumes": [
{
"containerPath": "/zeppelin-data",
"external": {
"name": "volume-zeppelin-data",
"provider": "dvdi",
"options": {
"dvdi/driver": "rexray"
}
},
"mode": "RW"
}
]
}
}
and set the following environment variables:
ZEPPELIN_DATA_VOLUME
to the mount path of the volume (e.g. /zeppelin-data
)ZEPPELIN_NOTEBOOK_DIR
to a subpath of the volume (e.g. /zeppelin-data/notebook
)It is recommended to use an external persistent volume so that data is not lost even when a node breaks down.
For authentication and authorization Zeppelin uses Shiro. It is configured using a file shiro.ini
. The docker image searches for this file in the sandbox directory on startup. You can provide it either via the fetch file mechanism or as a secret (recommended, only available on DC/OS EE).
To use a secret execute the following steps:
dcos security secrets create -f shiro.ini zeppelin/shiro-conf
{
"secrets": {
"shiroconf": {
"source": "zeppelin/shiro-conf"
}
}
}
{
"env": {
"ZEPPELIN_SHIRO_CONF": {
"secret": "shiroconf"
}
}
}
To use the fetch file mechanism:
{
"fetch": [
{"uri": "http://my.fileserver/zeppelin/shiro.ini", "extract": false, "executable": false, "cache": false }
]
}
To access HDFS from zeppelin you need to provide the files hdfs-site.xml
and core-site.xml
. if you installed the HDFS framework from the Universe, you just need to add the following fetch definition to the app:
{
"fetch": [
{ "uri": "http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints/hdfs-site.xml", "extract": false, "executable": false, "cache": false },
{ "uri": "http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints/core-site.xml", "extract": false, "executable": false, "cache": false }
]
}
Set the list of volumes which will be mounted into the Docker image, which was set using spark.mesos.executor.docker.image. The format of this property is a comma-separated list of mappings following the form passed to docker run -v. Define an ENV
variable
SPARK_MESOS_EXECUTOR_DOCKER_VOLUMES="[host_path:]container_path[:ro|:rw]"
e.g.:
SPARK_MESOS_EXECUTOR_DOCKER_VOLUMES="/mnt/share/data:/data:rw"
You can provide your own custom zeppelin-site.xml:
{
"fetch": [
{"uri": "http://my.fileserver/zeppelin/zeppelin-site.xml", "extract": false, "executable": false, "cache": false }
]
}
The docker image contains python2.7 and python3.4. You can use the python and pyspark interpreters without further configuration.
By default python2.7 is used, if you want to use python3.4 set the environment variable PYSPARK_PYTHON
to python3
.
You can also install additional python packages at startup. To do that set the environment variable PYTHON_PACKAGES
to a space-separated list of packages (for example PYTHON_PACKAGES="requests tensorflow"
). This list will be given to pip at startup. Be aware that installing packages increases the startup time of zeppelin.
The docker image contains R version 3.4 and already has the recommended packages from the Zeppelin documentation installed, specifically devtools
, knitr
, ggplot2
, mplot
and googleVis
. You can install additional R packages at startup by setting the environment variable R_PACKAGES
. The content of this variable will be fed directly to install.packages()
, so be sure to use the correct syntax (e.g. R_PACKAGES="c('glmnet', 'caret')"
). Be aware that installing packages can drastically increase the startup time of zeppelin.
The provided marathon app definition by default allows access to zeppelin via the admin router proxy ("Open Service" in the DC/OS UI). if you have marathon-lb installed you can also use it. Just add the following labels to the app:
{
"labels": {
"HAPROXY_GROUP": "external",
"HAPROXY_0_VHOST": "zeppelin.my.domain"
},
}
There are two variants of the docker image based on the download variants on the zeppelin homepage:
By default the app definitions use the all variant. If you want the netinst variant, just change the -all
in the docker image tag to -netinst
.
./build.sh
The build script will build docker images with zeppelin with all interpreters (all) or just the spark interpreter (netinst).
This project is based on the official mesosphere spark docker image.
If you find a bug or have a feature request, just open an issue in Github. Or, if you want to contribute something, feel free to open a pull request.