END OF LIFE

This project is not being maintained anymore.

Zeppelin on DC/OS

This project provides docker images and marathon app definitions to run Apache Zeppelin on DC/OS. Images are published to Docker Hub.

This is a custom-built image based on the mesosphere spark docker image because the official Zeppelin docker image does not contain the neccessary libraries for mesos and can not be configured for the extra features that are possible with DC/OS (see below).

This project is also available as a DC/OS universe package. Install it using dcos package install zeppelin.

The Spark interpreter in Zeppelin can currently not be used on a DC/OS EE cluster with strict security mode enabled. Please use a cluster configured with disabled or permissive security mode.

Quickstart

$ dcos package install zeppelin
This DC/OS Service is currently in preview.
Continue installing? [yes/no] yes
Installing Marathon app for package [zeppelin] version [1.1-0.8.1-2.4.0]
DC/OS Zeppelin is being installed!

    Documentation:  https://github.com/dcos/examples/tree/master/zeppelin/1.11
    Issues: https://dcos.io/community or
            https://github.com/MaibornWolff/dcos-zeppelin

For detailed instructions on using the universe package see the DC/OS Examples repo.

How to install using the marathon app

Use the marathon app definition in deploy/zeppelin-minimal.json as a basis
Choose the extra features you want from the list below and modify the json accordingly or use extended zeppelin-volume-shiro-hdfs.json file
Choose a image variant based on spark version
If you use another spark version than the default, don't forget to change the environment variable SPARK_MESOS_EXECUTOR_DOCKER_IMAGE
Change SPARK_CORES_MAX and SPARK_EXECUTOR_MEMORY depending on your cluster size and available resources
Deploy to marathon

Requirements

DC/OS 1.10, 1.11 or 1.12 (OpenSource or Enterprise)
Optional: HDFS
Optional: Marathon-LB
Optional: HTTP Fileserver

Features

All features are also available in the Universe package. Use dcos package describe zeppelin --config to get a complete list of possible configuration options.

Persistent Notebooks

The docker image is built to store notebook data on a persistent volume. To use it add a volume definition to the app

{
  "container": {
    "volumes": [
      {
        "containerPath": "/zeppelin-data",
        "external": {
          "name": "volume-zeppelin-data",
          "provider": "dvdi",
          "options": {
            "dvdi/driver": "rexray"
          }
        },
        "mode": "RW"
      }
    ]
  }
}

and set the following environment variables:

Set ZEPPELIN_DATA_VOLUME to the mount path of the volume (e.g. /zeppelin-data)
Set ZEPPELIN_NOTEBOOK_DIR to a subpath of the volume (e.g. /zeppelin-data/notebook)

It is recommended to use an external persistent volume so that data is not lost even when a node breaks down.

User Management

For authentication and authorization Zeppelin uses Shiro. It is configured using a file shiro.ini. The docker image searches for this file in the sandbox directory on startup. You can provide it either via the fetch file mechanism or as a secret (recommended, only available on DC/OS EE).

To use a secret execute the following steps:

Create your custom shiro.ini file, there is an example file in the deploy folder in this repo.
Create a secret from this file using the dcos cli: dcos security secrets create -f shiro.ini zeppelin/shiro-conf

Add a secrets definition to the app:

{
"secrets": {
  "shiroconf": {
    "source": "zeppelin/shiro-conf"
  }
}
}

Provide secret as an environment variable

{
"env": {
"ZEPPELIN_SHIRO_CONF": {
  "secret": "shiroconf"
}
}
}

To use the fetch file mechanism:

Create your custom shiro.ini file, there is an example file in the deploy folder in this repo.
Upload your shiro.ini file to a location accessible via http from your cluster.

Add a fetch definition to your app:

{
"fetch": [
{"uri": "http://my.fileserver/zeppelin/shiro.ini", "extract": false, "executable": false, "cache": false }
]
}

HDFS

To access HDFS from zeppelin you need to provide the files hdfs-site.xml and core-site.xml. if you installed the HDFS framework from the Universe, you just need to add the following fetch definition to the app:

{
  "fetch": [
    { "uri": "http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints/hdfs-site.xml", "extract": false, "executable": false, "cache": false },
    { "uri": "http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints/core-site.xml", "extract": false, "executable": false, "cache": false }
  ]
}

External executor volume

Set the list of volumes which will be mounted into the Docker image, which was set using spark.mesos.executor.docker.image. The format of this property is a comma-separated list of mappings following the form passed to docker run -v. Define an ENV variable

SPARK_MESOS_EXECUTOR_DOCKER_VOLUMES="[host_path:]container_path[:ro|:rw]"

e.g.:

SPARK_MESOS_EXECUTOR_DOCKER_VOLUMES="/mnt/share/data:/data:rw"

Extra configuration

You can provide your own custom zeppelin-site.xml:

Create your custom zeppelin-site.xml
Make sure not to change the default bind port for zeppelin (8080) as on startup this will be replaced with the host port of the container
Upload your zeppelin-site.xml file to a location accessible via http from your cluster.

Add a fetch definition to your app:

{
"fetch": [
{"uri": "http://my.fileserver/zeppelin/zeppelin-site.xml", "extract": false, "executable": false, "cache": false }
]
}

Python support

The docker image contains python2.7 and python3.4. You can use the python and pyspark interpreters without further configuration. By default python2.7 is used, if you want to use python3.4 set the environment variable PYSPARK_PYTHON to python3. You can also install additional python packages at startup. To do that set the environment variable PYTHON_PACKAGES to a space-separated list of packages (for example PYTHON_PACKAGES="requests tensorflow"). This list will be given to pip at startup. Be aware that installing packages increases the startup time of zeppelin.

R support

The docker image contains R version 3.4 and already has the recommended packages from the Zeppelin documentation installed, specifically devtools, knitr, ggplot2, mplot and googleVis. You can install additional R packages at startup by setting the environment variable R_PACKAGES. The content of this variable will be fed directly to install.packages(), so be sure to use the correct syntax (e.g. R_PACKAGES="c('glmnet', 'caret')"). Be aware that installing packages can drastically increase the startup time of zeppelin.

External access

The provided marathon app definition by default allows access to zeppelin via the admin router proxy ("Open Service" in the DC/OS UI). if you have marathon-lb installed you can also use it. Just add the following labels to the app:

{
  "labels": {
    "HAPROXY_GROUP": "external",
    "HAPROXY_0_VHOST": "zeppelin.my.domain"
  },
}

Interpreters

There are two variants of the docker image based on the download variants on the zeppelin homepage:

all: With all interpeters
netinst: With only the spark interpreter

By default the app definitions use the all variant. If you want the netinst variant, just change the -all in the docker image tag to -netinst.

Building

./build.sh

The build script will build docker images with zeppelin with all interpreters (all) or just the spark interpreter (netinst).

Acknowledgments

This project is based on the official mesosphere spark docker image.

Contributing

If you find a bug or have a feature request, just open an issue in Github. Or, if you want to contribute something, feel free to open a pull request.

MaibornWolff / dcos-zeppelin

readme