Closed glfeng318 closed 7 years ago
We don't use 2.0 and haven't tested against it. It's on the roadmap but any help is welcome. There it's just some scala interop error, an easy fix. Try commenting out cogroup
for a start.
Is there any work or timeline for getting 2.x working? I'm currently using Spark 2.1.0.
Thanks.
@cgore is working on it https://github.com/cgore/powderkeg/tree/newer-spark
If you want to join his effort or start your own, I'ld be happy to help/guide you.
@plandes We've made some progress but haven't worked on it in a few weeks. I'd be up for pairing on it sometime.
@cgore Thanks I might take you up on that. I need to first check to see if it fits my needs.
@cgrand and @cgore (or anyone) Has it been tested/working using Yarn as the cluster manager? If so what URL would you use? I have only used spark-submit with master=yarn and deploy-mode=cluster with a Yarn configured spark cluster.
One more question: does this project support Spark Streaming?
I've started my work by cloning @cgore repo and bumped the project.clj spark deps to 2.1.0. It looks like there are Clojure updates that are needed as well. @cgore is the easiest way to do this for me to fork and then create a pull request to your repo? Here is where my changes are going for now:
Be sure to track newer-spark branch. I went ahead and created a PR @cgore (https://github.com/cgore/powderkeg/pull/1), hoping that it's ok to do so :)
Looking at your work @plandes and https://github.com/cgore/powderkeg/pull/1 (both also bump kryo to 4.0.0 and Scala of com.twitter/chill to 2.11), I have a feeling that Spark 2.0 might be close :)
I made these changes proposed by @viesti but the second example doesn't work. However, I'm not really sure how to run the examples and I might have it misconfigured.
I am using (below) project to test powder keg. Maybe someone can take a look and help me get up and running faster. I at first thought the spark-submit had to run on the spark master, however, I tried it from the localhost and got a little further--that is I actually got a REPL.
For testing, one can start local Spark cluster by running ./start-all.sh
in spark-2.1.0-bin-hadoop2.7/sbin
(assumed that spark-2.1.0 built with hadoop 2.7 was downloaded). After that a lein repl
in your fork of powderkeg should be enough to be able to follow along, like in the main readme.
For testing an app, you'd bump the version in project.clj
and do lein install
followed by creating a project in which you can add your locally bumped version as a :dependency
.
A note about the process here, https://github.com/plandes/powderkeg/commits/master and https://github.com/cgore/powderkeg/tree/newer-spark have now similar changes, which makes it slightly harder to propose a singe PR upstream :). One fork with base changes supported with PR's from another forks of these forks might be better.
Is there any way to get it to work with a remote cluster?
Re process: I agree but I wanted to get a quick start, be able to show those changes and thought any changes I made could be resolved with other developers easily. Regardless, the merges shouldn't be bad. Either I can PR into @cgore fork or the primary and I'm open others' input.
Didn't yet try on a real cluster, just ran a local stand alone Spark cluster from the prebuild binaries and the did a (keg/connect! "spark://hostname-from-start-all-sh-logfile:7077")
. For YARN, one might need a real cluster or maybe try out a suitable Dockerized setup like this: https://hub.docker.com/r/gustavonalle/yarn/.
Re process: no worries, learning by doing is always good :) Was just thinking that @cgore's original commits could stay if we continue with forks on his work.
@plandes Using a REPL on Yarn could be perfected.
You have to run it in client
(to get stdin & stdout) but when you do so .addJar
doesn't work in all cases. If I remember correctly @iig sshed into a machine on the same network than the cluster and run the spark submit there.
So in the long run .addJar
should be replaced by something that copies them first.
I'm using a similar docker image to what @viesti mentioned: https://github.com/plandes/docker-spark-service/blob/master/Dockerfile
This is basically p7hb/docker-spark:2.1.0 with an automation script. However, when starting the the REPL with lein repl
in the powder keg repo I get the following:
user=> (keg/connect! "spark://192.168.99.100:7077")
user=> ExceptionInfo Can't filter out Spark jars! clojure.core/ex-info (core.clj:4617)
Sorry to have to do this but can we reschedule until tomorrow? I caught a cold yesterday but expecting to feel better. However I'm literally shivering in bed. Very sorry I hardly ever get sick.
On Mar 6, 2017, at 4:13 AM, Christophe Grand notifications@github.com wrote:
Using a REPL on Yarn could be perfected. You have to run it in client (to get stdin & stdout) but when you do so .addJar doesn't work in all cases. If I remember correctly @iig sshed into a machine on the same network than the cluster and run the spark submit there.
So in the long run .addJar should be replaced by something that copies them first.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Sure, we'll reschedule.
Haven't run Dockerized Spark, but the standalone binary 2.1.0 seems to work fine for keg/connect! at least on my MacBook :)
@alan-wischmeyer-climate and I messed around with it some today and the change @viesti made got us past the first example in the readme. It looks like there's an issue with the next example though. We are trying to add a unit test out of the first example.
@plandes Sorry to hear you aren't feeling well, we'll loop you in next time.
Could you post a log around the issue, I could try reproducing the issue too @cgore.
spark2 is the default on master
. Thanks to @cgore @viesti and @plandes for making it happen!
not support 2.0.2 for now?