Hearen / HadoopInitializer

Using pure shell to configure hadoop 2.7.1 environment in CentOS 7.1 cluster
GNU General Public License v3.0
1 stars 0 forks source link

HadoopInitializer

This collection of scripts are used to assist to automate the installation and configuration of hadoop 2.7 in a cluster composed of a certain amount of hosts connected to one another, by providing the least amount of manual information, dramatically reducing the effort of the hadoop configuration in a set of hosts.

During the whole installation and configuration process, all the user need to do is provide all the IP addresses of the hosts, input the password for each host when prompted, customize the hadoop xml configuration files with the assistance of the hint from the program and that's all, quite simple and convenient now to configure a hadoop cluster.

Features

Follow-up

Some auxiliary features are well accomplished to make this program more readable, robust and maintainable.

Usage

  1. git clone https://github.com/Hearen/HadoopInitializer.git
  2. cd HadoopInitializer
  3. cd etc
  4. insert the IP addresses of the hosts into ip_addresses in different lines
  5. cd ../hadoop
  6. configure the XML files as each hadoop cluster configuration does in core-site.xml, hdfs-site.xml, mapred-site.xml, master and slaves; as for master you only need to input hadoop-master while slaves will be inserted with all the slaves' new hostnames from hadoop-slave1, hadoop-slave2 to hadoop-slaveX, X here refers to the amount of the slaves in the cluster;
  7. cd ../tools
  8. ./clear_walls.sh #shut down and disable the firewall and selinux mode for all hosts, there will be a rebooting, so just be patient and have a bottle of iced beer
  9. su #after rebooting, log in the master and gain root privilege
  10. ./install.sh

and then just follow the program, good luck!

Support

This part will cover the enclosed assistant scripts in tools, basic hadoop commands, cgroup configuration and usage, stress used to control the CPU utilization, some useful Linux commands and the benchmarks used widely.

tools

There are lots of issues that might occur during the configuration, so there are some convenient tools that might be helpful when encountering some problems.

hadoop

Some frequently used commands in hadoop cluster management, for more detailed information you might need to check its official site.

cgroup

There is a cgroup_configurer.sh in tools directory which can help you a little bit for cgroup configurations as for details you might intend to check the details as follows:

Edit the /etc/cgconfig.conf as follows

group hadoop
{
    cpu {
cpu.shares = 400;
    }
    memory {
memory.limit_in_bytes = 1024m;
    }
    blkio {
blkio.throttle.read_bps_device = “8:0 209715”; 
    }
}

As for the major and minor number of the block device, we can use ls -l /dev/ to retrieve it easily.

Then edit /etc/cgrules to apply the rules defined in /etc/cgconfig.conf as follows:

hadoop blkio,cpu,memory hadoop/ 

Now the user of hadoop will be limited in blkio, cpu and memory as defined in /etc/cgconfig.conf; in the final end, we need to restart cgconfig and cgred to make the rules take effect instantly. There is a good reference for cgroup.

stress

Used to take over CPU resources of the machine to cooperate with the cgroup to control the CPU performance for a user. If you intend to limit the CPU only by cgroup, sadly you will fail considering it only takes cpu.shares into account which means if there are no other processes consuming the CPU the CPU then will be all used by current user. Here is good post to clarify this kind of issue.

To install stress, you have to configure epel repository first by yum install epel-release. As for epel-release you may want to check this post for further understanding.

Three most frequently used commands of stress:

If encountering some problems when installing stress:

  1. install epel-release first, refresh the repolist by yum install repolist and then yum install stress;
  2. install it manually wget http://apt.sw.be/redhat/el7/en/x86_64/rpmforge/RPMS/stress-1.0.2-1.el7.rf.x86_64.rpm and then rpm -ivh stress-1.0.2-1.el7.rf.x86_64.rpm

Some useful commands

Benchmarks

There are lots of built-in benchmarks which we can find in hadoop hadoop jar /home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1 or hadoop jar /home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-27.1-tests.jarbut as for the most popular ones, they will be CPU intensive type - pi, I/O intensive - TestDFSIO and integrated and the most popular - terasort.

pi - CPU intensive type

TestDFSIO - I/O intensive type

For more additional details about this benchmark, you may want to check [this](hadoop jar /home/hadoop/hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.1-tests.jar TestDFSIO -read -nrFiles 64 -fileSize 16GB -resFile /tmp/TestDFSIOwrite.txt);

terasort

Typical issues

There are some issues which can be tricky for newbies in hadoop that I met myself and solved with the following steps. I hope they might ease some labor of your searching.

  1. inconsistency clusterID among namenode and datanode: just stop it first, delete all the tmp directories in all hosts including master and slave and then format the hdfs and at last start the cluster again and check it by hdfs dfsadmin -report;
  2. check the ssh-login-without-password among hosts and ensure the firewall and selinux mode are all shut down and disabled which might result in some datanodes un-reachable or invalid in the cluster;
  3. when it only comes to the JAVA_HOME is not set problem, just hard code it in HADOOP_HOME/etc/hadoop/hadoop-env.sh and then re-run the install.sh and select the Copy hadoop configuration files for hadoop cluster.
  4. check if you are in safe mode and shut it down by hdfs dfsadmin -safemode leave;
  5. still not working, check the logs in namenode and datanode using ls -t in $HADOOP_HOME/logs and you can easily check the latest log of the namenode or datanode which should be helpful for debugging;

To be updated

There are several quite specific flaws that can be fixed as follows:

Of course there are still many aspects that can be further optimized including the following parts: fault-tolerance, portability and maintainability;

Contribution

  1. Fork it.
  2. Create a branch (git checkout -b my_branch)
  3. Commit your changes (git commit -am "fix some serious issues in configuring hadoop locally")
  4. Push to the branch (git push origin my_branch)
  5. Open a Pull Request in web page of the forked repository
  6. Enjoy a refreshing Diet Coke and carry on with your own stuff

Contributor

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. See LICENSE for more details.