Big Data Analytics Stack

Provides a set of Ansible playbooks to deploy a Big Data analytics stack on top of Hadoop/Yarn.

The play-hadoop.yml deploys the base system. Addons, such as Pig, Spark, etc, are deployed using the playbooks in the addons directory.

Stack

Legend:

[X] available
[ ] planned

Analytics Layer

[X] BLAS
[X] LAPACK
[X] LAPACKE
[ ] Mahout
[X] MLlib
[ ] MLbase
[X] Java
[ ] R+libraries
[ ] Python
- [ ] Pandas
- [ ] Scikit-learn
[ ] Tensorflow

Data Processing Layer

[X] Hadoop MapReduce
[X] Spark
[ ] Tez
[ ] Hama
[ ] Storm
[X] Hive
[X] Pig
[ ] Flink

Database Layer

[X] Drill
[ ] MongoDB
[ ] CouchDB
[X] HBase
[X] MySQL
[ ] PostgreSQL
[ ] Memcached
[ ] Redis

Scheduling

[X] Zookeeper
[X] YARN
[ ] Mesos

Storage

[X] HDFS

Monitoring

[X] Ganglia

Requirements

git
GitHub account with uploaded SSH keys (due to use of submodules)
Python, pip, virtualenv, libffi-dev, pkg-config
Nodes accessible by SSH to admin-privileged account

Quickstart

Clone this repository (you must have a GitHub account and uploaded your SSH)

$ git clone --recursive git://github.com/futuresystems/big-data-stack.git
$ cd big-data-stack

Create a virtualenv

$ virtualenv venv && source venv/bin/activate

Install the dependencies

(venv) $ pip install -r requirements.txt

Generate the inventory file

(venv) $ python mk-inventory -n bds- 10.0.0.10 10.0.0.11 >inventory.txt

Sanity check
```
(venv) $ ansible all -m ping
```
If this fails, ensure that the nodes are SSH-accessible and that the user is correct in ansible.cfg (alternatively, override using the -u $REMOTE_USERNAME) flag. You can pass -v to increase verbosity (add multiple for more details eg -vvvv).

Deploy

(venv) $ ansible-playbook play-hadoop.yml addons/spark.yml    # ... etc

Usage

Make sure to start an ssh-agent so you don't need to retype you passphrase multiple times. We've also noticied that if you are running on india, Ansible may be unable to access the node and complain with something like:

master0 | UNREACHABLE! => {
   "changed": false,
   "msg": "ssh cc@129.114.110.126:22 : Private key file is encrypted\nTo connect as a different user, use -u <username>.",
   "unreachable": true
}

To start the agent:

badi@i136 ~$ eval $(ssh-agent)
badi@i136 ~$ ssh-add

Make sure your public key is added to github.com IMPORTANT check the fingerprint ssh-keygen -lf ~/.ssh/id_rsa and make sure it is in your list of keys!
Download this repository using git clone --recursive. IMPORTANT: make sure you specify the --recursive option otherwise you will get errors.
```
  git clone --recursive https://github.com/futuresystems/big-data-stack.git
```
Install the requirements using pip install -r requirements.txt
Launch a virtual cluster and obtain the SSH-able IP addresses
Generate the inventory and variable files using ./mk-inventory For example:
```
./mk-inventory -n $USER-mycluster 192.168.10{1,2,3,4} >inventory.txt
```
Will define the inventory for a four-node cluster which nodes names as $USER-myclusterN (with N from 0..3)
Make sure that ansible.cfg reflects your environment. Look especially at remote_user if you are not using Ubuntu. You can alternatively override the user by passing -u $NODE_USERNAME to the ansible commands.
Ensure ssh_config is to your liking.
Run ansible all -m ping to make sure all nodes can be managed.
Run ansible-playbook play-hadoop.yml to install the base system
Run ansible-playbook addons/{pig,spark}.yml # etc to install the Pig and Spark addons.
Log into the frontend node (see the [frontends] group in the inventory) and use the hadoop user (sudo su - hadoop) to run jobs on the cluster.

Sidenote: you may want to pass the -f <N> flag to ansible-playbook to use N parallel connections. This will make the deployment go faster. For example:

$ ansible-playbook -f $(egrep '^[a-zA-Z]' inventory.txt | sort | uniq | wc -l) # etc ...

The hadoop user is present on all the nodes and is the hadoop administrator. If you need to change anything on HDFS, it must be done as hadoop.

Upgrading

Whenever a new release is made, you can get the changes by either cloning a fresh repository (as above), or pulling changes from the upstream master branch and updating the submodules:

$ git pull https://github.com/futuresystems/big-data-stack master
$ git submodule update
$ pip install -U -r requirements.txt

Examples

See the examples directory:

nist_fingerprint: fingerprint analysis using Spark with results pushed to HBase

License

Please see the LICENSE file in the root directory of the repository.

Contributing

Fork the repository
Add yourself to the CONTRIBUTORS.yml file
Submit a pull request to the unstable branch

futuresystems / big-data-stack

readme