USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Addjar #60

Closed buggtb closed 7 years ago

buggtb commented 7 years ago

Ability to add jar to spark context via command line option.

arelaxend commented 7 years ago

it's working

buggtb commented 7 years ago

Okay cool, at least I wasn't just making up requirements for the sake of it. I'll fix it up tonight or tomorrow and ship you over the changes.

karanjeets commented 7 years ago

Thanks @buggtb :-)

thammegowda commented 7 years ago

Thanks for doing this.

I feel we shall simply delegate this task to spark-submit tool http://spark.apache.org/docs/latest/submitting-applications.html

I suggest this because spark-submit is a sophisticated tool for doing this task. It provides us complete solution such as allocation of resources to the submitted job ( RAM, CPU etc).

@karanjeets @buggtb Let me know your opinions. If you guys agree to this, I will update maven built to produce a jar which will be optimized for spark-submit (optimization here - exclude unnecessary libs like scala, spark, hadoop etc since they ought to be picked up from the deployed cluster )

buggtb commented 7 years ago

Personally, I like the fact I can launch stuff to a remote cluster using the sparkler jar, as someone who is pretty spark ignorant I very much like the convenience although I can certainly see the benefits of using spark submit. I would support both and accept that the jar launcher might be quick and dirty but provides a certain level of convenience over spark-submit which we can certainly use for juju deployments and so on, supporting spark jobs from within the jar doesn't appear to add a lot of overheads except size, but you could build "fat" and "thin" jars at build time to trim it down.

On Thu, Dec 29, 2016 at 6:44 PM, Thamme Gowda notifications@github.com wrote:

Thanks for doing this.

I feel we shall simply delegate this task to spark-submit tool http://spark.apache.org/docs/latest/submitting-applications.html

I suggest this because spark-submit is a sophisticated tool for doing this task. It provides us complete solution such as allocation of resources to the submitted job ( RAM, CPU etc).

@karanjeets https://github.com/karanjeets @buggtb https://github.com/buggtb Let me know your opinions. If you guys agree to this, I will update maven built to produce a jar which is optimized for spark-submit (optimization here - exclude unnecessary libs like scala, spark, hadoop etc since they ought to be picked up from the deployed cluster )

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/USCDataScience/sparkler/pull/60#issuecomment-269674175, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGUeDW2oZzm-R3QGzZ8rJCia5Of3ZQ2ks5rM_-JgaJpZM4LWR7C .

-- Tom Barber CTO Spicule LTD tom@spicule.co.uk

http://spicule.co.uk

@spiculeim http://twitter.com/spiculeim

Schedule a meeting with me http://meetme.so/spicule

GB: +44(0)5603641316 US: +18448141689

buggtb commented 7 years ago

I've updated the PR with the changes @karanjeets suggested.

karanjeets commented 7 years ago

@thammegowda - Thanks for sharing your thoughts on this. I would +1 @buggtb 's response. Even I am in favor of having spark libs in Sparkler. This is extremely helpful for people who run small crawls and don't want to run into the hassle of standing up a cluster. Also, I liked the idea @buggtb suggested to handle this at build time and create 'fat' and 'thin' jars. The '--add-jar' command provides a great alternative and its implementation shows that it should be identical to the spark-submit '--jars'.

If you don't have any other queries and approve this PR, I will go ahead and merge.

@buggtb Thanks for the changes. 👍

thammegowda commented 7 years ago

@karanjeets My suggestion was to support both the types of builds. Maven had build profiles, using which we can produce one with fat jar including all libs, other as an optimized jar for spark submit.

:+1: proceed merging this PR.

We can support the spark-submit for advanced used when the need arise, that will not disturb this functionality. We will have to support it nevertheless, as I see resource allocations are essential when sharing the cluster for other jobs. As of now, we can use the same fat jar to spark-submit, only worry is that there is higher chance of class versions mismatch of transitive dependencies.

Thanks @buggtb

buggtb commented 7 years ago

To help get my builds and charm aligned, I'll merge this as you two have signed it off. Thanks for accepting it!