big-data-europe / docker-hadoop-spark-workbench

[EXPERIMENTAL] This repo includes deployment instructions for running HDFS/Spark inside docker containers. Also includes spark-notebook and HDFS FileBrowser.
688 stars 374 forks source link

General Questions - Multihost, Spark Version and Apache Zeppelin #15

Closed prof-schacht closed 7 years ago

prof-schacht commented 8 years ago

Hi

I have only some general questions. How to you handle the docker containers on multiple physical hosts?

Is it possible to extend the example to SPARK Version 2.0 and add also Apache Zeppelin as Notebook Driver to this repo?

And is it possible to scale also the datanode from hadoop hdfs?

BR

earthquakesan commented 8 years ago

Hi @prof-schacht,

for multiple physical hosts see overlay networking in docker

spark 2.0.0 with hive is available in our repo, essentially you just need to change version number. You can see example setup of hadoop, spark + hive used for sqlspark application here.

I have no experience with Apache Zeppelin. Would recommend to take a look into existing docker images and see if something can work with remote Spark/Hadoop installation via configuration options (or Hive).

To scale HDFS you need to spawn more datanode containers and connect it to namenode, in the same manner as it defined in the docker-compose.yml

prof-schacht commented 8 years ago

Hi,

thanks for your advise. I have not worked before with overlay networks. But is there a possibility to define the host on which the certain docker nodes should run in the docker-compose file. Or do I have to start each docker node on each host after I created the overlay network?

I was wondering if it is possible to have on script on one of the bare metal hosts which will fire up all aspects (overlay network, docker pull and run) on the hosts. And also destroy it from there.

Another question; I try to add zeppelin to this configuration instead of spark notebooks. But therefore I would need the hadoop configuration files. Could you give me a hint, where I could find these ones?

Tanks again for your help.

BR

earthquakesan commented 8 years ago

Hi @prof-schacht,

For running a container on a particular node read on general tools docker swarm provides in this article and how to use compose to use them in this article. For instance, if you want to run a container on a node with "frontend" label (one of the docker hosts should have docker daemon with this label -- you setup labels in /etc/default/docker on Ubuntu 14.04) you would do something like this:

version: "2"
services:
  bar:
    image: bar
    labels:
      - "constraint:node==frontend"

For setting up docker automatically, we use ansible for the pilots in BDE project. You can find deliverables of the project on the official web site. What you should look for is in D5.1 on page 8 (see ansible). We use it just for deploying the docker, but you can extend the scripts for your particular use case. I recommend using ansible for this task in general.

Regarding hadoop configuration it works as follows. Every service refer to a hadoop.env file. For example, namenode (see env_file):

  namenode:
    image: bde2020/hadoop-namenode:1.0.0
    hostname: namenode
    container_name: namenode
    domainname: hadoop
    networks:
      - hadoop
    volumes:
      - ./data/namenode:/hadoop/dfs/name
    environment:
      - CLUSTER_NAME=test
    env_file:
      - ./hadoop.env
    ports:
      - "50070:50070"
      - "8020:8020"

Then in hadoop.env you have variables of which include filename + parameter + value:

CORE_CONF_fs_defaultFS=hdfs://namenode:8020

This variable will inject fs.defaultFS with value hdfs://namenode:8020 into core-site.xml. The injection of the vars are defined in this script, which is part of docker-hadoop repo.

Please, make a pull request, when you are done. That would be a really nice contribution to the project! If you need any help, do not hesitate to ask.

DrSnowbird commented 7 years ago

Hi, Regarding the integration with Zeppelin (either server or docker) integration, I am trying to have Zeppelin to access the Spark Master via the port 7077 expose in docker-compose*.yml. Could I suggest that you change with one more line in expose the port so that I I don't have to do it in my Fork. spark-master: image: earthquakesan/hadoop-spark-master:1.0.0 hostname: spark-master container_name: spark-master domainname: hadoop networks:

Thanks much. Ray

earthquakesan commented 7 years ago

Hi Ray! @DrSnowbird

Done. Please use docker-compose-java8.yml as it contains the last version of hadoop/spark and java 8 jdk. If your Zeppelin docker image is free to share, please link. -)

Kind regards, Ivan.

DrSnowbird commented 7 years ago

I am still working how to adopt your ways to set up Zeppelin. Currently, for host-based setup, I used REST API to pull information for all the clients configuration for Zeppelin using Ambari. But, there is no Ambari in the docker-based setup of yours. So, I have to adapt to your ways to setup Zeppelin client. And, I will share the link once I done it.

earthquakesan commented 7 years ago

@DrSnowbird hi! Any updates on Zeppelin integration?

DrSnowbird commented 7 years ago

Hi Ivan, I am on and off working on Zeppelin Integration since my day job is taking a lot of my energy.Currently, I am in the final stage of fixing the version incompatibility of Java object serial version between Zeppelin, Spark/Scalar.Once I overcome that I will send you the request for figuring out how to deploy this additional module. Cheers,- Ray== As you can see that I am working on my git branch with one more "folder" to yours.[user1@xeon1 docker-hadoop-spark-workbench]$ cd '/mnt/xeon1_data/docker-github-PUBLIC/docker-spark-bde2020' [user1@xeon1 docker-spark-bde2020]$ lltotal 28drwxrwxr-x. 2 user1 user1 4096 May  7 19:15 basedrwxrwxr-x. 2 user1 user1 4096 May  7 19:15 master-rw-rw-r--. 1 user1 user1 2473 May  7 19:15 README.mddrwxrwxr-x. 2 user1 user1 4096 May  7 19:15 submitdrwxrwxr-x. 4 user1 user1 4096 May  7 19:15 templatedrwxrwxr-x. 2 user1 user1 4096 May  7 19:15 workerdrwxrwxr-x. 8 user1 user1 4096 May 13 16:25 zeppelin

On Wednesday, April 19, 2017 7:53 AM, Ivan Ermilov <notifications@github.com> wrote:

@DrSnowbird hi! Any updates on Zeppelin integration?— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

earthquakesan commented 7 years ago

@DrSnowbird

Hi Ray,

I am very sorry to spoil all the fun: https://github.com/big-data-europe/docker-zeppelin And here is an example with external jar which uses it: https://github.com/SANSA-Stack/SANSA-Notebooks

Please, take a look. You can proceed with your integration and then we can merge it into a better one.

Enjoy the weekend, Ivan.

DrSnowbird commented 7 years ago

Hi Ivan, I was able to overcome the object serial version conflict and now it is running ok. And, I tested Spark notebooks in Zeppelin Interpreters spark, Python3 (including matplotlib), they are working correctly. I will check the rest Zeppelin interpreters too tomorrow. I will run more test and polish more tomorrow and then I will check in the code for merging or something to merge with yours. 

DrSnowbird commented 7 years ago

Hi Ivan, I checked in into my github for the Zeppelin that I am working on.It is just the initial working version and it can be further improved and simplified as your ways is quite simple.But, I want to make it parameterized all the way from host environment - still not fully parameterized to the host env  yet. For you now, you can check it out both Docker hub and Git hub.

DrSnowbird commented 7 years ago

Ivan, I clean up the directory and create a new github repo to just focus at zeppelin and pointing to your BDE2020 big project git hub.

And, docker hub buildopenkbs/docker-spark-bde2020-zeppelin