HCADatalab / powderkeg

Live-coding the cluster!
Eclipse Public License 1.0
159 stars 23 forks source link

Spark 2 support #4

Closed glfeng318 closed 7 years ago

glfeng318 commented 7 years ago
test.core=> (require '[powderkeg.core :as keg])
Preparing for self instrumentation.
Ouroboros succesfully eating its own tail!
Counting classes.
Retrieving bytecode of 531 classes dynamically defined by Clojure (out of 3821 classes)... done!
Instrumenting clojure.lang.Var... done!

CompilerException java.lang.IllegalArgumentException: No matching ctor found for class org.apache.spark.rdd.CoGroupedRDD
, compiling:(powderkeg/core.clj:398:13)

not support 2.0.2 for now?

cgrand commented 7 years ago

We don't use 2.0 and haven't tested against it. It's on the roadmap but any help is welcome. There it's just some scala interop error, an easy fix. Try commenting out cogroup for a start.

plandes commented 7 years ago

Is there any work or timeline for getting 2.x working? I'm currently using Spark 2.1.0.

Thanks.

cgrand commented 7 years ago

@cgore is working on it https://github.com/cgore/powderkeg/tree/newer-spark

If you want to join his effort or start your own, I'ld be happy to help/guide you.

cgore commented 7 years ago

@plandes We've made some progress but haven't worked on it in a few weeks. I'd be up for pairing on it sometime.

plandes commented 7 years ago

@cgore Thanks I might take you up on that. I need to first check to see if it fits my needs.

@cgrand and @cgore (or anyone) Has it been tested/working using Yarn as the cluster manager? If so what URL would you use? I have only used spark-submit with master=yarn and deploy-mode=cluster with a Yarn configured spark cluster.

One more question: does this project support Spark Streaming?

plandes commented 7 years ago

I've started my work by cloning @cgore repo and bumped the project.clj spark deps to 2.1.0. It looks like there are Clojure updates that are needed as well. @cgore is the easiest way to do this for me to fork and then create a pull request to your repo? Here is where my changes are going for now:

https://github.com/plandes/powderkeg

viesti commented 7 years ago

Be sure to track newer-spark branch. I went ahead and created a PR @cgore (https://github.com/cgore/powderkeg/pull/1), hoping that it's ok to do so :)

viesti commented 7 years ago

Looking at your work @plandes and https://github.com/cgore/powderkeg/pull/1 (both also bump kryo to 4.0.0 and Scala of com.twitter/chill to 2.11), I have a feeling that Spark 2.0 might be close :)

plandes commented 7 years ago

I made these changes proposed by @viesti but the second example doesn't work. However, I'm not really sure how to run the examples and I might have it misconfigured.

plandes commented 7 years ago

I am using (below) project to test powder keg. Maybe someone can take a look and help me get up and running faster. I at first thought the spark-submit had to run on the spark master, however, I tried it from the localhost and got a little further--that is I actually got a REPL.

https://github.com/plandes/clj-pktest

viesti commented 7 years ago

For testing, one can start local Spark cluster by running ./start-all.sh in spark-2.1.0-bin-hadoop2.7/sbin (assumed that spark-2.1.0 built with hadoop 2.7 was downloaded). After that a lein repl in your fork of powderkeg should be enough to be able to follow along, like in the main readme.

For testing an app, you'd bump the version in project.clj and do lein install followed by creating a project in which you can add your locally bumped version as a :dependency.

A note about the process here, https://github.com/plandes/powderkeg/commits/master and https://github.com/cgore/powderkeg/tree/newer-spark have now similar changes, which makes it slightly harder to propose a singe PR upstream :). One fork with base changes supported with PR's from another forks of these forks might be better.

plandes commented 7 years ago

Is there any way to get it to work with a remote cluster?

Re process: I agree but I wanted to get a quick start, be able to show those changes and thought any changes I made could be resolved with other developers easily. Regardless, the merges shouldn't be bad. Either I can PR into @cgore fork or the primary and I'm open others' input.

viesti commented 7 years ago

Didn't yet try on a real cluster, just ran a local stand alone Spark cluster from the prebuild binaries and the did a (keg/connect! "spark://hostname-from-start-all-sh-logfile:7077"). For YARN, one might need a real cluster or maybe try out a suitable Dockerized setup like this: https://hub.docker.com/r/gustavonalle/yarn/.

Re process: no worries, learning by doing is always good :) Was just thinking that @cgore's original commits could stay if we continue with forks on his work.

cgrand commented 7 years ago

@plandes Using a REPL on Yarn could be perfected. You have to run it in client (to get stdin & stdout) but when you do so .addJar doesn't work in all cases. If I remember correctly @iig sshed into a machine on the same network than the cluster and run the spark submit there.

So in the long run .addJar should be replaced by something that copies them first.

plandes commented 7 years ago

I'm using a similar docker image to what @viesti mentioned: https://github.com/plandes/docker-spark-service/blob/master/Dockerfile

This is basically p7hb/docker-spark:2.1.0 with an automation script. However, when starting the the REPL with lein repl in the powder keg repo I get the following:

user=> (keg/connect! "spark://192.168.99.100:7077")

user=> ExceptionInfo Can't filter out Spark jars!  clojure.core/ex-info (core.clj:4617)
plandes commented 7 years ago

Sorry to have to do this but can we reschedule until tomorrow? I caught a cold yesterday but expecting to feel better. However I'm literally shivering in bed. Very sorry I hardly ever get sick.

On Mar 6, 2017, at 4:13 AM, Christophe Grand notifications@github.com wrote:

Using a REPL on Yarn could be perfected. You have to run it in client (to get stdin & stdout) but when you do so .addJar doesn't work in all cases. If I remember correctly @iig sshed into a machine on the same network than the cluster and run the spark submit there.

So in the long run .addJar should be replaced by something that copies them first.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

cgore commented 7 years ago

Sure, we'll reschedule.

viesti commented 7 years ago

Haven't run Dockerized Spark, but the standalone binary 2.1.0 seems to work fine for keg/connect! at least on my MacBook :)

cgore commented 7 years ago

@alan-wischmeyer-climate and I messed around with it some today and the change @viesti made got us past the first example in the readme. It looks like there's an issue with the next example though. We are trying to add a unit test out of the first example.

@plandes Sorry to hear you aren't feeling well, we'll loop you in next time.

viesti commented 7 years ago

Could you post a log around the issue, I could try reproducing the issue too @cgore.

cgrand commented 7 years ago

spark2 is the default on master. Thanks to @cgore @viesti and @plandes for making it happen!