LLNL / magpie

Magpie contains a number of scripts for running Big Data software in HPC environments, including Hadoop and Spark. There is support for Lustre, Slurm, Moab, Torque. LSF, Flux, and more.
GNU General Public License v2.0
191 stars 53 forks source link

Problem when testing the "TeraSort" example #96

Closed YaweiZhao closed 8 years ago

YaweiZhao commented 8 years ago

I am a PhD student and want to do data analysis on the supercomputers such as Tianhe-1 and Tianhe-2. So I have deployed Magpie on both of two supercomputers. After the deployment is completed, we test Magpie by using the given examples by Hadoop such as TeraGen, TeraSort, PI and WordCount.

However, it does not work very well all the time. On one hand, it works well for the TeraGen, PI and WordCount applications. On the other hand, it can not work for the TeraSort application at all. The error information which is printed to the console of the master node is presented as below: image

Then I scan the log file of the slave node, i.e., cn1268, and find: image

I try to shut down all the Hadoop daemons, and re-start it. However, the problem still exists. Besides, I re-compile the TeraSort program and find no problems on the master node, i.e., resource manager (I add many diagnose information and print them on the console of master node.).

p.s.

I wonder why the applications like WordCount, PI and TeraGen can run successfully, but TeraSort fails. Would it be possible to help us find the bug, and improve Magpie together?

chu11 commented 8 years ago

Typically a bind exception occurs b/c another copy of the daemon (in this case the Yarn daemon) is already running. I'd make sure that the daemon has been killed across all nodes before launching a job. The daemon may have been left running if a prior job was killed incorrectly & the resource manager did not kill the daemon correctly.

YaweiZhao commented 8 years ago

I ran teragen and terasort by hand, instead of using the batch script. Since I login in Tianhe-1 and allocate new computer nodes to perform teragen and terasort every time, there are no existing active Yarn daemons before I perform terasort. Besides, I notice some information below in the file "magpie.sbatch-srun": capture Since Tianhe-1 supercomputer does not have local storage, is it not appropriate to install Hadoop over rawnetworkfs instead of HDFS over Lustre? Do you think that this might cause the bug?

chu11 commented 8 years ago

First try and use the Magpie terasort test by setting HADOOP_MODE to "terasort". If that works, then there may be errors with how you are running it by hand.

I would recommend running with HDFS over Lustre first instead of rawnetworkfs. There is a known bug in terasort when running with rawnetworkfs. See https://issues.apache.org/jira/browse/MAPREDUCE-5528

YaweiZhao commented 8 years ago

I try to run terasort on HPC clusters by using the batch script in Magpie. However, I find that the namenode always fails to exit safe mode. The log file is attached. slurm-1036890.txt

capture Then, I log in the master node and exit the safe mode by hand. But another error is reported. capture

Besides, something below confuses me. capture I remember that reducers cannot be started if maps do not finished. So I specify the value of HADOOP_MAPREDUCE_SLOWSTART 0.99. Do I misunderstand something?

chu11 commented 8 years ago

What errors are you seeing in the HDFS namenode master log file? That will explain why the name is failing to exit safe mode.

You do not need to change HADOOP_MAPREDUCE_SLOWSTART, that is an advanced tuning parameter. See the Hadoop configuration mapreduce.job.reduce.slowstart.completedmaps if you want to read about the nitty gritty details on what it means.

YaweiZhao commented 8 years ago

The error in the HDFS namenode master log file is presented as below: in file "hadoop-kd_yjp-secondarynamenode-cn872.log">> capture11 in file "yarn-kd_yjp-resourcemanager-cn872.log">> capture10 Then, I change the value of variable "fs.defaultFS" in the file "magpie/conf/core-site-2.0.xml" as the error prompts. Its value is "hdfs://hdfs/". capture12

Such an error disappears. However, the namenode master still cannot exit safemode. I find no errors in the log file.

Besides, I try to run wordcount, but there is a bug: image

It seems that hdfs is not configured correctly.

bpanneton commented 8 years ago

This might not be the issue, as I am not familiar with Lustre, however, have you tried setting MAGPIE_NO_LOCAL_DIR="yes"? If so make sure that the configuration directories that are listed in the log actually can be accessed from the node. The HPC system I use does not have local storage and I use HADOOP_FILESYSTEM_MODE="hdfsovernetworkfs". Perhaps you can post your job script as well.

chu11 commented 8 years ago

dropping issue, i'm assuming solved by now