Cascading / vagrant-cascading-hadoop-cluster

Deploying apache-hadoop in a virtualized cluster as easy as 1-2-3.
127 stars 49 forks source link

add Spark to the cluster #18

Open gregbaker opened 9 years ago

gregbaker commented 9 years ago

This adds Spark 1.4.0 to the cluster setup. I have tested it a little: spark jobs can access HDFS files (as hdfs://master.local:9000/home/vagrant/...) and jobs can be sent out to the cluster with a command like this:

spark-submit --master yarn-cluster ...

The download required during the provisioning is about 240MB: I don't know if that's enough to make you think that leaving the spark manifest commented out in manifests/master-single.pp is wise.

I haven't updated the README: again, I'm not sure if it's worth advertising there.

gregbaker commented 9 years ago

I have continued to add changes to my fork: fiddled with the HDFS replication (so files aren't available on every node, which is realistic) and updated version of the tools (to Hadoop 2.7.1 and other current versions). Certainly feel free to cherry-pick as necessary if these aren't considered relevant to this project's goals.

tristanreid commented 9 years ago

Looks cool! I may fork off this to add parquet-tools (https://github.com/Parquet/parquet-mr/tree/master/parquet-tools)

tristanreid commented 9 years ago

Greg, this is really great! One thing: hbase has moved from 1.1.1->1.1.2. The build only works for me if I make that change in modules/hbase/manifests/init.pp and modules/phoenix/manifests/init.pp