NICTA / scoobi

A Scala productivity framework for Hadoop.
http://nicta.github.com/scoobi/
482 stars 97 forks source link

A true Scoobi install (no requirement to build from source) #234

Closed blever closed 1 year ago

blever commented 11 years ago

Scoobi is almost at 0.7.0 so it's time for it to grow up and support users without requiring them to download and build the source from scratch. This is how I propose the experience of using Scoobi should be for someone who is not a Scoobi developer:

User experience

In the simplest case, a Scoobi application can be developed with a minimal build.sbt (do I hear giter8 anyone), and a user JAR built using sbt package, and run via a scoobi launcher script:

name := "ScoobiSnax"

version := "1.0"

scalaVersion := "2.10.1"

libraryDependencies ++= Seq(
  "com.nicta" %% "scoobi" % "0.7.0-cdh4" % "provided"
)

resolvers ++= Seq(
  "sonatype" at "http://oss.sonatype.org/content/repositories/releases/"
)
> scoobi jar SccobiSnax.jar com.acme.DooMain in-files out-files

The scoobi script is akin to the hadoop script and is essentially responsible for ensuring all the correct JARs are available on both the client and cluster, and with the correct precedence. Apart from the core user JAR (e.g. ScoobiSnax.jar above), the scoobi script will locate other JARs in the following places:

To install Scoobi, a user will download and un-tar the scoobi tarball, e.g. scoobi-0.7.0-cdh4.tgz. This could be placed in /usr/lib/scoobi to allow access to many people or simply a location in the user's home directory. The SCOOBI_HOME environment variable should be set to point to this new directory. A Scoobi install would then result in a $SCOOBI_HOME looking like the following:

Most applications will of course be dependent on more than just Scoobi:

name := "ScoobiSnax2"

version := "1.0"

scalaVersion := "2.10.1"

libraryDependencies ++= Seq(
  "com.nicta" %% "scoobi" % "0.7.0-cdh4" % "provided",
  "org.spire-math" %% "spire" % "0.3.0"
)

resolvers ++= Seq(
  "sonatype" at "http://oss.sonatype.org/content/repositories/releases/"
)

In this case, a user could opt to use something like sbt-assembly to construct a fat JAR containing their user code as well as spire. With this approach, it's business as usual. Alternatively, if the user sticks with sbt package, the spire JAR can be located using the following mechanisms:

With the availability of a scoobi script, its usage can be overloaded for launching into the Scoobi REPL:

> scoobi shell -cp spire.jar

Classpath management

One of the biggest pain points, which this proposal is aiming to fix, is that of classpath management. When using Hadoop we need to consider two classpaths, client and cluster, and they must be managed separately:

Client If the hadoop script is used under-the-hood, need to ensure that user classpaths are given precedence to the hadoop ones. This is achieved by setting HADOOP_USER_CLASSPATH_FIRST to true.

Cluster Firstly, Scoobi and its dependent JARs, as well as all user JARs, need to be available on the cluster. Scoobi will construct a mapred.jar on the fly, but all other JARs must be explicitly pushed to the distributed cache and placed on the cluster classpath, mapred.classpath. Secondly, and again, need to ensure that user classpaths have precedence to Hadoop's. In the cluster setting, this is achieved by setting mapreduce.task.classpath.user.precedence to true.

blever commented 11 years ago

@etorreborre @espringe @raronson - hi guys, would love to get some feedback on this proposal. I think that providing this would really round out the 0.7 release.

espringe commented 11 years ago

On Sun, Apr 14, 2013 at 5:42 AM, Ben Lever notifications@github.com wrote:

Scoobi is almost at 0.7.0 so it's time for it to grow up and support users without requiring them to download and build the source from scratch. This is how I propose the experience of using Scoobi should be for someone who is not a Scoobi developer:

I definitely agree with this goal, but I don't get what makes building scoobi a requirement right? Don't the snapshots and published jars do the trick?

User experience

In the simplest case, a Scoobi application can be developed with a minimal build.sbt (do I hear giter8 anyone), and a user JAR built using sbt package, and run via a scoobi launcher script:

name := "ScoobiSnax"

version := "1.0"

scalaVersion := "2.10.1"

libraryDependencies ++= Seq( "com.nicta" %% "scoobi" % "0.7.0-cdh4" % "provided"

^^^ Here be dargons. And I do understand why you want to use the 'provided', but consider you're going to have to handle overlapping dependencies anyway (e.g. scoobi uses shapeless, someones app might also use shapeless). You might as well handle the "overlapping" scoobi dependency and not use provided. That way people can still run scoobi applications normally. Especially when you want to use scoobi inside eclipse or in another application -- most people don't have much experience (and most importantly, don't want to) in all this jar/classpath stuff

)

resolvers ++= Seq( "sonatype" at "http://oss.sonatype.org/content/repositories/releases/" )

scoobi jar SccobiSnax.jar com.acme.DooMain in-files out-files

The scoobi script is akin to the hadoop script and is essentially responsible for ensuring all the correct JARs are available on both the client and cluster, and with the correct precedence. Apart from the core user JAR (e.g. ScoobiSnax.jar above), the scoobi script will locate other JARs in the following places:

  • $SCOOBI_HOME/lib: Note that the scoobi dependency is marked as provided above as it and all its dependencies are provided by the Scoobi install (see below). The lib directory contains the scoobi_2.10-0.7.0-cdh4.jar as well as all its dependencies, e.g. scala-library, scalaz-core, xstream, javassist, avro, etc. Note, however, it doesn't contain any CDH4 Hadoop JARs.

This feels like its duplicating the job of sbt, and a pain in general. Despite disliking sbt, I actually like sbt, because it manages all this for me -- and allows me to easily switch versions of stuff, and use other projects without having to learn something new etc.Li

  • hadoop classpath: The classpath according to the hadoop script. This generally requires that HADOOP_HOME be set.

If it can be avoided, I think it should. I get the impression that scoobi's audience is more scala people than hadoop people. The less they need to know about hadoop (which at least I find confusing and painful to setup locally) the better. Right now, I haven't even installed hadoop (and thus, obviously don't have $HADOOP_HOME set) and would have a strong preference for never having to

To install Scoobi, a user will download and un-tar the scoobi tarball, e.g. scoobi-0.7.0-cdh4.tgz. This could be placed in /usr/lib/scoobi to allow access to many people or simply a location in the user's home directory. The SCOOBI_HOME environment variable should be set to point to this new directory. A Scoobi install would then result in a $SCOOBI_HOMElooking like the following:

  • bin: contains scoobi script
  • lib: the Scoobi JAR plus all its dependent JARs, exluding Hadoop JARs
  • src: the source used to build the above Scoobi JAR - essentially a copy of the repository used to build the Scoobi JAR
  • probably a bunch of other files for completeness like README.md, LICENSE.txt, NOTICE.txt, etc

Yuck :( Well, it's not so bad. But I still don't understand the pros. Personally I'd much rather be using sbt to manage this than doing it myself. And doubly so as someone with my own "fork" of scoobi. Doing "publish-local" sounds a lot more pleasant than screwing around with this

Most applications will of course be dependent on more than just Scoobi:

name := "ScoobiSnax2"

version := "1.0"

scalaVersion := "2.10.1"

libraryDependencies ++= Seq( "com.nicta" %% "scoobi" % "0.7.0-cdh4" % "provided", "org.spire-math" %% "spire" % "0.3.0" )

resolvers ++= Seq( "sonatype" at "http://oss.sonatype.org/content/repositories/releases/" )

In this case, a user could opt to use something like sbt-assembly to construct a fat JAR containing their user code as well as spire. With this approach, it's business as usual. Alternatively, if the user sticks with sbt package, the spire JAR can be located using the following mechanisms:

  • Setting SCOOBI_CLASSPATH to include the spire JARs (and any other JARs for that matter);
  • Using the -cp option with the launcher script: scoobi -cp spire.jar

This is sounding more and more painful. Having external dependencies is going to be virtually every application. Even the stupid program I wrote for google codejam has 4 external dependencies. And now managing these classpaths and stuff has to be done by each and every user of the application, on each and every machine they use the application from.

REPL mode

With the availability of a scoobi script, its usage can be overloaded for launching into the Scoobi REPL:

scoobi shell -cp spire.jar

Just in general, I dislike the idea of managing a scoobi program/script and setting environment variables manually. I wouldn't be opposed to adding a scoobi plugin to ~/.sbt/plugins though. And I wouldn't even mind going "sbt console" and then "import ScoobiRepl._" for a bunch of repl-util like functions.

Classpath management

One of the biggest pain points, which this proposal is aiming to fix, is that of classpath management.

~~ I might be missing something -- as the solution seems to be: "manage it explicitly". Which is fine, and I do think scoobi needs to support that better (since some of the stuff like upload jars is awesome, but also a too magical and error prone).

But I don't think we should sacrifice so much convenience to get there. I'd think one should be able to check out a scoobi application, and go: "sbt run" and after the 10 minutes of w/e ridiculous time sbt takes, it'll have downloaded everything and can run everything. I really don't want to have to install/configure hadoop or scoobi.

blever commented 11 years ago

Hi Eric - thanks for your comments. I've put my responses inline. The summary is that this proposal wouldn't prohibit you from continuing with your current workflow, just that it would simplify developing and running Scoobi apps for others.

Scoobi is almost at 0.7.0 so it's time for it to grow up and support users

without requiring them to download and build the source from scratch. This is how I propose the experience of using Scoobi should be for someone who is not a Scoobi developer:

I definitely agree with this goal, but I don't get what makes building scoobi a requirement right? Don't the snapshots and published jars do the trick?

Fair point. It's not really related to building from source - as you say, the published JARs get around that problem. This proposal is really focused on simplifying the process of running a Scoobi app for the common user.

User experience

In the simplest case, a Scoobi application can be developed with a minimal build.sbt (do I hear giter8 anyone), and a user JAR built using sbt package, and run via a scoobi launcher script:

name := "ScoobiSnax"

version := "1.0"

scalaVersion := "2.10.1"

libraryDependencies ++= Seq( "com.nicta" %% "scoobi" % "0.7.0-cdh4" % "provided"

^^^ Here be dargons. And I do understand why you want to use the 'provided', but consider you're going to have to handle overlapping dependencies anyway (e.g. scoobi uses shapeless, someones app might also use shapeless). You might as well handle the "overlapping" scoobi dependency and not use provided. That way people can still run scoobi applications normally. Especially when you want to use scoobi inside eclipse or in another application -- most people don't have much experience (and most importantly, don't want to) in all this jar/classpath stuff

So this is not intended to be the only way to use Scoobi. There is nothing stopping you from not using "provided" but it does mean you have to manually manage your dependencies.

resolvers ++= Seq( "sonatype" at "http://oss.sonatype.org/content/repositories/releases/" )

scoobi jar SccobiSnax.jar com.acme.DooMain in-files out-files

The scoobi script is akin to the hadoop script and is essentially responsible for ensuring all the correct JARs are available on both the client and cluster, and with the correct precedence. Apart from the core user JAR (e.g. ScoobiSnax.jar above), the scoobi script will locate other JARs in the following places:

  • $SCOOBI_HOME/lib: Note that the scoobi dependency is marked as provided above as it and all its dependencies are provided by the Scoobi install (see below). The lib directory contains the scoobi_2.10-0.7.0-cdh4.jar as well as all its dependencies, e.g. scala-library, scalaz-core, xstream, javassist, avro, etc. Note, however, it doesn't contain any CDH4 Hadoop JARs.

This feels like its duplicating the job of sbt, and a pain in general. Despite disliking sbt, I actually like sbt, because it manages all this for me -- and allows me to easily switch versions of stuff, and use other projects without having to learn something new etc.Li

As above, this proposal wouldn't prohibit you from doing this.

  • hadoop classpath: The classpath according to the hadoop script. This generally requires that HADOOP_HOME be set.

If it can be avoided, I think it should. I get the impression that scoobi's audience is more scala people than hadoop people. The less they need to know about hadoop (which at least I find confusing and painful to setup locally) the better. Right now, I haven't even installed hadoop (and thus, obviously don't have $HADOOP_HOME set) and would have a strong preference for never having to

This is probably the root of the discussion. For users that do have a Hadoop install, and possibly other tools such as Hive and Pig installed also, this proposal would allow them to leverage Scoobi in a similar manner. And, its in these environments, for example when the hadoop script is used to launch a Scoobi app, that giving a user a seamless experience would be a win.

To install Scoobi, a user will download and un-tar the scoobi tarball, e.g. scoobi-0.7.0-cdh4.tgz. This could be placed in /usr/lib/scoobi to allow access to many people or simply a location in the user's home directory. The SCOOBI_HOME environment variable should be set to point to this new directory. A Scoobi install would then result in a $SCOOBI_HOMElooking like the following:

  • bin: contains scoobi script
  • lib: the Scoobi JAR plus all its dependent JARs, exluding Hadoop JARs
  • src: the source used to build the above Scoobi JAR - essentially a copy of the repository used to build the Scoobi JAR
  • probably a bunch of other files for completeness like README.md, LICENSE.txt, NOTICE.txt, etc

Yuck :( Well, it's not so bad. But I still don't understand the pros. Personally I'd much rather be using sbt to manage this than doing it myself. And doubly so as someone with my own "fork" of scoobi. Doing "publish-local" sounds a lot more pleasant than screwing around with this

Managing it yourself isn't that trivial. It's really important that you have the right set and version of dependencies, and that you're not clashing with other versions of the same JARs on the client and/or cluster classpaths. To me, this is the main pain point this proposal attempts to alleviate. Also, it's not really targeted at users that maintain their own "fork".

Most applications will of course be dependent on more than just Scoobi:

name := "ScoobiSnax2"

version := "1.0"

scalaVersion := "2.10.1"

libraryDependencies ++= Seq( "com.nicta" %% "scoobi" % "0.7.0-cdh4" % "provided", "org.spire-math" %% "spire" % "0.3.0" )

resolvers ++= Seq( "sonatype" at "http://oss.sonatype.org/content/repositories/releases/" )

In this case, a user could opt to use something like sbt-assembly to construct a fat JAR containing their user code as well as spire. With this approach, it's business as usual. Alternatively, if the user sticks with sbt package, the spire JAR can be located using the following mechanisms:

  • Setting SCOOBI_CLASSPATH to include the spire JARs (and any other JARs for that matter);
  • Using the -cp option with the launcher script: scoobi -cp spire.jar

This is sounding more and more painful. Having external dependencies is going to be virtually every application. Even the stupid program I wrote for google codejam has 4 external dependencies. And now managing these classpaths and stuff has to be done by each and every user of the application, on each and every machine they use the application from.

I actually don't see why you think this is painful. Agree the common case is actually having external dependencies :) but if you go down the sbt-assembly route sbt will manage everything for you anyway. The other option is basically an escape hatch so you don't have to create fat JARs which can be useful in some cases.

REPL mode

With the availability of a scoobi script, its usage can be overloaded for launching into the Scoobi REPL:

scoobi shell -cp spire.jar

Just in general, I dislike the idea of managing a scoobi program/script and setting environment variables manually. I wouldn't be opposed to adding a scoobi plugin to ~/.sbt/plugins though. And I wouldn't even mind going "sbt console" and then "import ScoobiRepl._" for a bunch of repl-util like functions.

So, sbt console should work like this already ... without a plugin ... I think.

Classpath management

One of the biggest pain points, which this proposal is aiming to fix, is that of classpath management.

I might be missing something -- as the solution seems to be: "manage it explicitly". Which is fine, and I do think scoobi needs to support that better (since some of the stuff like upload jars is awesome, but also a too magical and error prone).

But I don't think we should sacrifice so much convenience to get there. I'd think one should be able to check out a scoobi application, and go: "sbt run" and after the 10 minutes of w/e ridiculous time sbt takes, it'll have downloaded everything and can run everything. I really don't want to have to install/configure hadoop or scoobi.

The problem is .... Hadoop - funny that :) Seriously though, because Hadoop has a bunch of JARs installed on the cluster, and possibly on the client, I find it starts to get really messy - e.g. overlaps, etc. So, yes, you can manager entirely with sbt, but we've run into a lot of problems with this, particularly with people who aren't great at sbt and dependency management, but are capable of writing Scala/Scoobi. I'd like to make life easier for them so they can just run their app. It would require an initial Scoobi install but I figure they've already downloaded and installed Hadoop anyway.

espringe commented 11 years ago

On Sun, Apr 14, 2013 at 9:14 PM, Ben Lever notifications@github.com wrote:

Hi Eric - thanks for your comments. I've put my responses inline. The summary is that this proposal wouldn't prohibit you from continuing with your current workflow, just that it would simplify developing and running Scoobi apps for others.

Fair enough. I've got no objections to anything, and simplying the deployment is definitely be welcome. I just really like how now "sbt run" gets you going so quickly (locally), and I think that's something we should keep. And then, when you want to run on the cluster, yeah, no matter what you do, it's going to be a pain :D

Managing it yourself isn't that trivial. It's really important that you

have the right set and version of dependencies, and that you're not clashing with other versions of the same JARs on the client and/or cluster classpaths. To me, this is the main pain point this proposal attempts to

alleviate. Also, it's not really targeted at users that maintain their own

"fork".

Yeah, but unless I misread your proposal you're suggesting that users manually manage their dependencies (uploading them, setting paths, etc.) which itself seems kind of painful too -- especially as sbt already knows this information, so it'd be nice to leverage that :D

I actually don't see why you think this is painful.

Well, say I wanted to use scoobi in a production environment -- I'd have to provide automation for the building, deploying and running. This itself is a huge task, if you consider it might involve learning complex and undocumented systems and porting scoobi and its transitive deps to it (for the build) ;D. So the more that my scoobi application resembles a stock-standard-library, that I can evoke from some java-shim -- the far easier it will be. If I had to figure out how to do deployments with binaries / jar paths / etc. -- I'd probably cry (But as you said, the new method won't be required -- so I've got no complaint)

Agree the common case is actually having external dependencies :) but if you go down the sbt-assembly route sbt will manage everything for you anyway.

Yeah, i really like sbt-assembly when it all works. I kind of just wish they supported our usecase a lot better

blever commented 11 years ago

Yeah, but unless I misread your proposal you're suggesting that users manually manage their dependencies (uploading them, setting paths, etc.) which itself seems kind of painful too -- especially as sbt already knows this information, so it'd be nice to leverage that :D

So, there are 3 sets of dependencies:

  1. Hadoop and its dependencies;
  2. Scoobi and its dependencies;
  3. User code dependencies.

I'm optimising for the case where someone else is already managing Hadoop and Scoobi dependencies, e.g. Cloudera and scoobi committers, respectively, and they are provided somewhere. In this case, as a user, when I go to run my Scoobi app I only specify the location of my code's direct dependencies, or just use sbt-assembly.

Agree that this means going down the provided path for scoobi, which means sbt run doesn't work as it won't have all the dependencies. I'm sure, however, that a build.sbt could be re-written such that you could run sbt-assembly or sbt package with scoobi marked as provided, but when using sbt run it isn't.

blever commented 11 years ago

For reference, #222 was tackling a similar problem and opted for a somewhat similar solution.