Closed YaweiZhao closed 8 years ago
Typically a bind exception occurs b/c another copy of the daemon (in this case the Yarn daemon) is already running. I'd make sure that the daemon has been killed across all nodes before launching a job. The daemon may have been left running if a prior job was killed incorrectly & the resource manager did not kill the daemon correctly.
I ran teragen and terasort by hand, instead of using the batch script. Since I login in Tianhe-1 and allocate new computer nodes to perform teragen and terasort every time, there are no existing active Yarn daemons before I perform terasort. Besides, I notice some information below in the file "magpie.sbatch-srun": Since Tianhe-1 supercomputer does not have local storage, is it not appropriate to install Hadoop over rawnetworkfs instead of HDFS over Lustre? Do you think that this might cause the bug?
First try and use the Magpie terasort test by setting HADOOP_MODE to "terasort". If that works, then there may be errors with how you are running it by hand.
I would recommend running with HDFS over Lustre first instead of rawnetworkfs. There is a known bug in terasort when running with rawnetworkfs. See https://issues.apache.org/jira/browse/MAPREDUCE-5528
I try to run terasort on HPC clusters by using the batch script in Magpie. However, I find that the namenode always fails to exit safe mode. The log file is attached. slurm-1036890.txt
Then, I log in the master node and exit the safe mode by hand. But another error is reported.
Besides, something below confuses me. I remember that reducers cannot be started if maps do not finished. So I specify the value of HADOOP_MAPREDUCE_SLOWSTART 0.99. Do I misunderstand something?
What errors are you seeing in the HDFS namenode master log file? That will explain why the name is failing to exit safe mode.
You do not need to change HADOOP_MAPREDUCE_SLOWSTART, that is an advanced tuning parameter. See the Hadoop configuration mapreduce.job.reduce.slowstart.completedmaps if you want to read about the nitty gritty details on what it means.
The error in the HDFS namenode master log file is presented as below: in file "hadoop-kd_yjp-secondarynamenode-cn872.log">> in file "yarn-kd_yjp-resourcemanager-cn872.log">> Then, I change the value of variable "fs.defaultFS" in the file "magpie/conf/core-site-2.0.xml" as the error prompts. Its value is "hdfs://hdfs/".
Such an error disappears. However, the namenode master still cannot exit safemode. I find no errors in the log file.
Besides, I try to run wordcount, but there is a bug:
It seems that hdfs is not configured correctly.
This might not be the issue, as I am not familiar with Lustre, however, have you tried setting MAGPIE_NO_LOCAL_DIR="yes"? If so make sure that the configuration directories that are listed in the log actually can be accessed from the node. The HPC system I use does not have local storage and I use HADOOP_FILESYSTEM_MODE="hdfsovernetworkfs". Perhaps you can post your job script as well.
dropping issue, i'm assuming solved by now
I am a PhD student and want to do data analysis on the supercomputers such as Tianhe-1 and Tianhe-2. So I have deployed Magpie on both of two supercomputers. After the deployment is completed, we test Magpie by using the given examples by Hadoop such as TeraGen, TeraSort, PI and WordCount.
However, it does not work very well all the time. On one hand, it works well for the TeraGen, PI and WordCount applications. On the other hand, it can not work for the TeraSort application at all. The error information which is printed to the console of the master node is presented as below:
Then I scan the log file of the slave node, i.e., cn1268, and find:
I try to shut down all the Hadoop daemons, and re-start it. However, the problem still exists. Besides, I re-compile the TeraSort program and find no problems on the master node, i.e., resource manager (I add many diagnose information and print them on the console of master node.).
p.s.
I wonder why the applications like WordCount, PI and TeraGen can run successfully, but TeraSort fails. Would it be possible to help us find the bug, and improve Magpie together?