utilize starfish to optimize the hadoop configuration

Hearen commented 7 years ago

the steps and the examples should be provided detailedly

Hearen commented 7 years ago

Of course, before everything else, you have to first install hadoop 0.20.2 properly.

Download starfish from: http://www.cs.duke.edu/starfish/files/starfish-0.3.0.tar.gz
tar -zxvf starfish-0.3.0.tar.gz
go to starfish-0.3.0/bin and edit config.sh according to https://github.com/Hearen/Starfish/blob/master/starfish/docs/profile.readme and here I set the three values as follows

SLAVES_BTRACE_DIR=/home/hadoop/starfish-0.3.0/starfish_test/btrace_dir CLUSTER_NAME=starfish_test PROFILER_OUTPUT_DIR=/home/hadoop/starfish-0.3.0/starfish_test/profile_output_dir
install btrace for all the machines by ./install_btrace.sh slaves. slaves here refers to a file containing all the hosts of the cluster you planned to monitor. An example here can be

133.133.135.34 133.133.135.37 133.133.131.18
in hadoop 0.20.2 all the benchmarks we could use are originated from these two built-in packages:
- hadoop jar hadoop-0.20.2-examples.jar
  
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. dbcount: An example job that count the pageview counts from a database. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using monte-carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sleep: A job that sleeps at each map and reduce task. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files.

hadoop jar hadoop-0.20.2-test.jar

DFSCIOTest: Distributed i/o benchmark of libhdfs. DistributedFSCheck: Distributed checkup of the file system consistency. MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures TestDFSIO: Distributed i/o benchmark. dfsthroughput: measure hdfs throughput filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed) loadgen: Generic map/reduce load generator mapredtest: A map/reduce test check. mrbench: A map/reduce benchmark that can create many small jobs nnbench: A benchmark that stresses the namenode. testarrayfile: A test for flat files of binary key/value pairs. testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce testfilesystem: A test for FileSystem read/write. testipc: A test for ipc. testmapredsort: A map/reduce program that validates the map-reduce framework's sort. testrpc: A test for rpc. testsequencefile: A test for flat files of binary key value pairs. testsequencefileinputformat: A test for sequence file input format. testsetfile: A test for flat files of binary key/value pairs. testtextinputformat: A test for text input format. threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill

try profile now ./profile hadoop jar /home/hadoop/hadoop/hadoop-0.20.2-examples.jar pi 16 10000
once the job profiled, we could utilize optimize to optimize the job for more details: https://github.com/Hearen/Starfish/blob/master/starfish/docs/optimize.readme

Hearen commented 7 years ago

To utilize its Visualizer to take advantage of its GUI, we have to install a GUI Desktop first. Here I will try to install yum -y groups install "GNOME Desktop"" and start the new desktop environment by startx and then to enable it permanently, we have to execute systemctl set-default graphical.traget to avoid startx each time trying to use desktop.

Hearen commented 7 years ago

As for the hadoop 0.20.2 installation references:

Hearen commented 7 years ago

hadoop dfsadmin -safemode leave

Hearen commented 7 years ago

http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/

Hearen commented 7 years ago

http://yut.hatenablog.com/entry/20120510/1336606109

Hearen commented 7 years ago

PiEst hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar pi 10 200

wget http://www.gutenberg.org/files/4300/4300-0.txt Wordcount hadoop jar /home/hadoop-test/hadoop/hadoop-0.20.2-examples.jar wordcount /user/hadoop-test/input /user/hadoop-test/wordcount/output

Teragen hadoop jar /home/hadoop-test/hadoop/hadoop-0.20.2-examples.jar teragen 10000 /user/hadoop-test/tera/input

terasort hadoop jar /home/hadoop-test/hadoop/hadoop-0.20.2-examples.jar terasort /user/hadoop-test/tera/input /user/hadoop-test/tera/output

teravalidate hadoop jar /home/hadoop-test/hadoop/hadoop-0.20.2-examples.jar teravalidate /user/hadoop-test/tera/output /user/hadoop-test/tera/validate

grep hadoop jar /home/hadoop-test/hadoop/hadoop-0.20.2-examples.jar grep /user/hadoop-test/input /user/hadoop-test/grep/output 'ab?'

Hearen commented 7 years ago

stop-all.sh hadoop namenode -format start-all.sh

Hearen / HadoopInitializer

utilize starfish to optimize the hadoop configuration #25