futuresystems / big-data-stack

Hadoop-based Big Data stack (hdfs, yarn, spark, etc)
Apache License 2.0
6 stars 17 forks source link

Big Data Analytics Stack

Provides a set of Ansible playbooks to deploy a Big Data analytics stack on top of Hadoop/Yarn.

The play-hadoop.yml deploys the base system. Addons, such as Pig, Spark, etc, are deployed using the playbooks in the addons directory.

Stack

Legend:

Analytics Layer

Data Processing Layer

Database Layer

Scheduling

Storage

Monitoring

Requirements

Quickstart

Usage

  1. Make sure to start an ssh-agent so you don't need to retype you passphrase multiple times. We've also noticied that if you are running on india, Ansible may be unable to access the node and complain with something like:

    master0 | UNREACHABLE! => {
       "changed": false,
       "msg": "ssh cc@129.114.110.126:22 : Private key file is encrypted\nTo connect as a different user, use -u <username>.",
       "unreachable": true
    }

    To start the agent:

    badi@i136 ~$ eval $(ssh-agent)
    badi@i136 ~$ ssh-add
  2. Make sure your public key is added to github.com IMPORTANT check the fingerprint ssh-keygen -lf ~/.ssh/id_rsa and make sure it is in your list of keys!

  3. Download this repository using git clone --recursive. IMPORTANT: make sure you specify the --recursive option otherwise you will get errors.

      git clone --recursive https://github.com/futuresystems/big-data-stack.git
  4. Install the requirements using pip install -r requirements.txt

  5. Launch a virtual cluster and obtain the SSH-able IP addresses

  6. Generate the inventory and variable files using ./mk-inventory For example:

    ./mk-inventory -n $USER-mycluster 192.168.10{1,2,3,4} >inventory.txt

    Will define the inventory for a four-node cluster which nodes names as $USER-myclusterN (with N from 0..3)

  7. Make sure that ansible.cfg reflects your environment. Look especially at remote_user if you are not using Ubuntu. You can alternatively override the user by passing -u $NODE_USERNAME to the ansible commands.

  8. Ensure ssh_config is to your liking.

  9. Run ansible all -m ping to make sure all nodes can be managed.

  10. Run ansible-playbook play-hadoop.yml to install the base system

  11. Run ansible-playbook addons/{pig,spark}.yml # etc to install the Pig and Spark addons.

  12. Log into the frontend node (see the [frontends] group in the inventory) and use the hadoop user (sudo su - hadoop) to run jobs on the cluster.

Sidenote: you may want to pass the -f <N> flag to ansible-playbook to use N parallel connections. This will make the deployment go faster. For example:

$ ansible-playbook -f $(egrep '^[a-zA-Z]' inventory.txt | sort | uniq | wc -l) # etc ...

The hadoop user is present on all the nodes and is the hadoop administrator. If you need to change anything on HDFS, it must be done as hadoop.

Upgrading

Whenever a new release is made, you can get the changes by either cloning a fresh repository (as above), or pulling changes from the upstream master branch and updating the submodules:

$ git pull https://github.com/futuresystems/big-data-stack master
$ git submodule update
$ pip install -U -r requirements.txt

Examples

See the examples directory:

License

Please see the LICENSE file in the root directory of the repository.

Contributing

  1. Fork the repository
  2. Add yourself to the CONTRIBUTORS.yml file
  3. Submit a pull request to the unstable branch