dylanmei / docker-zeppelin

Docker build for Zeppelin, a web-based Spark notebook
221 stars 122 forks source link

Do we have all the below interpreter in this docker-zeppelin container #6

Closed mkscala closed 8 years ago

mkscala commented 8 years ago

zeppelin-interpreter zeppelin-zengine spark-dependencies spark markdown angular shell hive phoenix postgresql jdbc tajo flink ignite kylin lens cassandra elasticsearch interpreter notebook

mkscala commented 8 years ago

When I tried the %elasticsearch and %hive it does not seem to work %sql, %pyspark is working fine

dylanmei commented 8 years ago

Are you using dylanmei/zeppelin:latest or a custom build? How did you configure the elasticsearch interpreter? hive?

mkscala commented 8 years ago

I have directly used dylanmei/zepplin:latest without any changes. Does the ElasticSearch comes within this only does it only have spark cluster? Have you tried elasticserach ? If so can i add the elasticsearch cluster to the custom build. If so please let me know

dylanmei commented 8 years ago

Zeppelin only comes with the ElasticSearch interpreter, not ElaticSearch itself. However, using docker-compose it's trivial to add ElasticSearch.

1) Create a new docker-compose.yml, and add an elasticsearch image, like below:

zeppelin:
  image: dylanmei/zeppelin:latest
  environment:
    ZEPPELIN_PORT: 8080
  links:
    - elasticsearch:elasticsearch
  ports:
    - 8080:8080
elasticsearch:
  image: elasticsearch:2.3
  ports:
    - 9200:9200
    - 9300:9300

Run docker-compose up and open Zeppelin on port 8080 as usual.

2) Go the the interpreters section, scroll down to the ElasticSearch interpreter, and change elasticsearch.host from localhost to elasticsearch. When you save, if Zeppelin asks to restart the interpreter, choose yes.

3) Create a new notebook, add these three paragraphs, and run them.

first paragraph

%sh curl -s http://elasticsearch:9200/

second paragraph

%elasticsearch index /testing/test/1 {"hello": "world"}

third paragraph

%elasticsearch search /testing/test

You can learn more about the ElasticSearch interpreter here: http://zeppelin.incubator.apache.org/docs/0.6.0-incubating-SNAPSHOT/interpreter/elasticsearch.html

mkscala commented 8 years ago

Awesome. Lots of details. Thanks a lot. I will try and let you know. With these addition can I try the below in your dylanmei/zeppelin:latest container, basically i wanted to store the data processed in Spark to Elasticsearch, do you have some sample ? I will anyways try over the weekend.

https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext. import org.elasticsearch.spark.
val conf = ... val sc = new SparkContext(conf)
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3) val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran") sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/docs")

mkscala commented 8 years ago

Hi, do you have your skype or email id so that i can clarify few doubts.?

dylanmei commented 8 years ago

Ok, that is different than using the interpreter.

You need to add elasticsearch settings to the Spark Intepreter, and load an elasticsearch-spark dependency.

Update your docker-compose

zeppelin:
  image: dylanmei/zeppelin:latest
  environment:
    ZEPPELIN_PORT: 8080
    ZEPPELIN_JAVA_OPTS: >-
      -Dspark.driver.memory=1g
      -Dspark.executor.memory=2g
    SPARK_HOME: /usr/spark
    SPARK_SUBMIT_OPTIONS: >-
      --conf spark.es.nodes=elasticsearch
      --conf spark.es.nodes.wan.only=true
      --conf spark.es.port=9200
    MASTER: local[*]
  links:
    - elasticsearch:elasticsearch
  ports:
    - 8080:8080
elasticsearch:
  image: elasticsearch:2.3
  ports:
    - 9200:9200
    - 9300:9300

In a new notebook, add a dependency paragraph

%dep z.load("org.elasticsearch:elasticsearch-spark_2.10:2.2.0")

Write to an index

%spark
import org.elasticsearch.spark.sql._

case class Thing(id: Integer, name: String)
val things = Seq(
  Thing(1, "a"),
  Thing(2, "b"),
  Thing(3, "c"),
  Thing(4, "d"),
  Thing(5, "e"))

val df1 = sc.parallelize(things).toDF()
EsSparkSQL.saveToEs(df1, "things/thing", Map("es.mapping.id" -> "id"))

Read from the index

%spark
val df2 = EsSparkSQL.esDF(sqlc,"things/thing")
df2.show()

Your output should be

df2: org.apache.spark.sql.DataFrame = [id: bigint, name: string]
+---+----+
| id|name|
+---+----+
|  5|   e|
|  2|   b|
|  4|   d|
|  1|   a|
|  3|   c|
+---+----+
mkscala commented 8 years ago

Awesome. I will try this. This is cool example. I basically wanted to mine raw email/chat transcripts for certain keywords. and group the documents(chat/email) accordingly. Have you tried any ML algorithm for the same within Spark/Elasticsearch/MLlib

dylanmei commented 8 years ago

That's real interesting. I have not tried anything like that. I have done heavy writing into ElasticSearch with Spark and it works well.

dylanmei commented 8 years ago

There is now an ElasticSearch-specific example docker-compose file in the./examples directory, based on our conversations here. You may need to re-pull the dylanmei/zeppelin:latest image to use it.

mkscala commented 8 years ago

Awesome. Thanks for all your help.