Provides a set of Ansible playbooks to deploy a Big Data analytics stack on top of Hadoop/Yarn.
The play-hadoop.yml
deploys the base system. Addons, such as Pig,
Spark, etc, are deployed using the playbooks in the addons
directory.
Legend:
Clone this repository (you must have a GitHub account and uploaded your SSH)
$ git clone --recursive git://github.com/futuresystems/big-data-stack.git
$ cd big-data-stack
Create a virtualenv
$ virtualenv venv && source venv/bin/activate
Install the dependencies
(venv) $ pip install -r requirements.txt
Generate the inventory file
(venv) $ python mk-inventory -n bds- 10.0.0.10 10.0.0.11 >inventory.txt
Sanity check
(venv) $ ansible all -m ping
If this fails, ensure that the nodes are SSH-accessible and that
the user is correct in ansible.cfg
(alternatively, override
using the -u $REMOTE_USERNAME
) flag. You can pass -v
to
increase verbosity (add multiple for more details eg -vvvv
).
Deploy
(venv) $ ansible-playbook play-hadoop.yml addons/spark.yml # ... etc
Make sure to start an ssh-agent so you don't need to retype you passphrase multiple times.
We've also noticied that if you are running on india
, Ansible may be unable to access the node and complain with something like:
master0 | UNREACHABLE! => {
"changed": false,
"msg": "ssh cc@129.114.110.126:22 : Private key file is encrypted\nTo connect as a different user, use -u <username>.",
"unreachable": true
}
To start the agent:
badi@i136 ~$ eval $(ssh-agent)
badi@i136 ~$ ssh-add
Make sure your public key is added to github.com IMPORTANT check the fingerprint ssh-keygen -lf ~/.ssh/id_rsa
and make sure it is in your list of keys!
Download this repository using git clone --recursive
. IMPORTANT: make sure you specify the --recursive
option otherwise you will get errors.
git clone --recursive https://github.com/futuresystems/big-data-stack.git
Install the requirements using pip install -r requirements.txt
Launch a virtual cluster and obtain the SSH-able IP addresses
Generate the inventory and variable files using ./mk-inventory
For example:
./mk-inventory -n $USER-mycluster 192.168.10{1,2,3,4} >inventory.txt
Will define the inventory for a four-node cluster which nodes names
as $USER-myclusterN
(with N
from 0..3
)
Make sure that ansible.cfg
reflects your environment. Look
especially at remote_user
if you are not using Ubuntu. You can
alternatively override the user by passing -u $NODE_USERNAME
to
the ansible commands.
Ensure ssh_config
is to your liking.
Run ansible all -m ping
to make sure all nodes can be managed.
Run ansible-playbook play-hadoop.yml
to install the base system
Run ansible-playbook addons/{pig,spark}.yml # etc
to install the
Pig and Spark addons.
Log into the frontend node (see the [frontends]
group in the inventory) and use the hadoop
user (sudo su - hadoop
) to run jobs on the cluster.
Sidenote: you may want to pass the -f <N>
flag to ansible-playbook
to use N
parallel connections.
This will make the deployment go faster.
For example:
$ ansible-playbook -f $(egrep '^[a-zA-Z]' inventory.txt | sort | uniq | wc -l) # etc ...
The hadoop
user is present on all the nodes and is the hadoop administrator.
If you need to change anything on HDFS, it must be done as hadoop
.
Whenever a new release is made, you can get the changes by either cloning a fresh repository (as above), or pulling changes from the upstream master branch and updating the submodules:
$ git pull https://github.com/futuresystems/big-data-stack master
$ git submodule update
$ pip install -U -r requirements.txt
See the examples
directory:
nist_fingerprint
: fingerprint analysis using Spark with results pushed to HBasePlease see the LICENSE
file in the root directory of the repository.
CONTRIBUTORS.yml
fileunstable
branch