[Spike] Identify the experiments authors have designed

DonFlat commented 2 years ago

What was the goal?
How it conducted?
Which kind of infra?
What dataset?
what scale (number of machines)?
How many times repeated?
What statistical methods were applied?

DonFlat commented 2 years ago

Is it possible to kill a node while running? Is possible to limit memory for each node?

DonFlat commented 2 years ago

Find out how to run pi calculation against Spark and Hadoop.

Benchmark for Pi: how long does it take to reach a decimal digit?

DonFlat commented 2 years ago

Examples to run hadoop example applications https://blog.csdn.net/carefree2005/article/details/121834803

Hadoop cluster set up https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html https://www.youtube.com/watch?v=_iP2Em-5Abw https://www.linode.com/docs/guides/how-to-install-and-set-up-hadoop-cluster/

Hadoop mapreduce example https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Overview

Hadoop configuration files manual https://hadoop.apache.org/docs/r3.3.4/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

It seems that we only have word count as example for both Spark and Hadoop?

Spark cluster mode: https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types

DonFlat commented 2 years ago

Two files that need to be modified with latest node name:

/var/scratch/ddps2206/hadoop-3.3.4/etc/hadoop/core-site.xml -- setting up namenode
/var/scratch/ddps2206/hadoop-3.3.4/etc/hadoop/yarn-site.xml -- setting up resource manager Both are indeed master node

modify worker

Default master node: node102, worker: node103

hdfs-site.xml contains replica number

DonFlat commented 2 years ago

https://stackoverflow.com/questions/28241251/hadoop-fs-ls-results-in-no-such-file-or-directory https://stackoverflow.com/questions/27143409/what-the-command-hadoop-namenode-format-will-do https://stackoverflow.com/questions/18862875/what-exactly-is-hadoop-namenode-formatting

DonFlat commented 2 years ago

In HiBench, the following workloads have Hadoop version: ml/Kmeans, ml/bayes, websearch/pagerank, sql/aggregation,join,scan
micro/dfsioe, micro/sleep, micro/sort, micro/terasort, micro/wordcount

DonFlat commented 2 years ago

The hadoop submission command:

/var/scratch/ddps2206/hadoop-3.3.4/bin/hadoop --config /var/scratch/ddps2206/hadoop-3.3.4/etc/hadoop jar /var/scratch/ddps2206/HiBench/autogen/target/autogen-8.0-SNAPSHOT-jar-with-dependencies.jar org.apache.mahout.clustering.kmeans.GenKMeansDataset -D hadoop.job.history.user.location=hdfs://node108:9000/HiBench/Kmeans/Input/samples -sampleDir hdfs://node108:9000/HiBench/Kmeans/Input/samples -clusterDir hdfs://node108:9000/HiBench/Kmeans/Input/cluster -numClusters 5 -numSamples 30000 -samplesPerFile 6000 -sampleDimension 3

DonFlat commented 2 years ago

/var/scratch/ddps2206/HiBench/conf/hibench.conf to adjust the size of input 636M small/ 1.4M tiny/ 4.0G large/ 18.5G Huge/ /var/scratch/ddps2206/HiBench/conf/spark.conf to adjust memory cores, default is 4G both

hadoop fs -get /HiBench/Kmeans/Input/samples/*

DonFlat / ddps_assn1

[Spike] Identify the experiments authors have designed #2