alibaba / mpich2-yarn

Running MPICH2 on Yarn
114 stars 62 forks source link

tips for debugging? #34

Open schmidb opened 9 years ago

schmidb commented 9 years ago

Hi,

all code stops at "INFO client.Client: Initializing ApplicationMaster" (cpi works with mpiexec on master)

[hadoop@ip-172-31-36-126 ~]$ hadoop jar mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar -a ./cpi -M 1024 -m 1024 -n 5 14/09/24 18:54:39 INFO client.Client: Initializing Client 14/09/24 18:54:39 INFO client.Client: Container number is 5 14/09/24 18:54:39 INFO client.Client: Application Master's jar is /home/hadoop/mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar 14/09/24 18:54:39 INFO client.Client: Starting Client 14/09/24 18:54:39 INFO util.Utilities: ****BELOW IS CONFIGUATIONS FROM Client **** key=TERM; value=xterm-256color key=HADOOP_PREFIX; value=/home/hadoop key=PIG_CONF_DIR; value=/home/hadoop/pig/conf key=JAVA_HOME; value=/usr/java/latest key=HBASE_HOME; value=/home/hadoop/hbase key=HIVE_HOME; value=/home/hadoop/hive key=HADOOP_YARN_HOME; value=/home/hadoop key=HADOOP_DATANODE_HEAPSIZE; value=384 key=SSH_CLIENT; value=54.240.217.9 19295 22 key=HADOOP_NAMENODE_HEAPSIZE; value=768 key=YARN_HOME; value=/home/hadoop key=MAIL; value=/var/spool/mail/hadoop key=HOSTNAME; value=ip-172-31-36-126.ec2.internal key=PWD; value=/home/hadoop key=IMPALA_CONF_DIR; value=/home/hadoop/impala/conf key=LESS_TERMCAP_mb; value= key=LESS_TERMCAP_me; value= key=LESS_TERMCAP_md; value= key=NLSPATH; value=/usr/dt/lib/nls/msg/%L/%N.cat key=AWS_AUTO_SCALING_HOME; value=/opt/aws/apitools/as key=HISTSIZE; value=1000 key=HADOOP_COMMON_HOME; value=/home/hadoop key=PATH; value=/home/hadoop/pig/bin:/usr/local/cuda/bin:/usr/java/latest/bin:/home/hadoop/bin:/home/hadoop/mahout/bin:/home/hadoop/hive/bin:/home/hadoop/hbase/bin:/home/hadoop/impala/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/aws/bin:/home/hadoop/cascading/tools/multitool-20140224/bin:/home/hadoop/cascading/tools/load-20140223/bin:/home/hadoop/cascading/tools/lingual-client/bin:/home/hadoop/cascading/driven/bin key=HIVE_CONF_DIR; value=/home/hadoop/hive/conf key=HADOOPCLASSPATH; value=:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/lib/_ key=HADOOP_CONF_DIR; value=/home/hadoop/conf key=IMPALA_HOME; value=/home/hadoop/impala key=AWS_IAM_HOME; value=/opt/aws/apitools/iam key=SHLVL; value=1 key=XFILESEARCHPATH; value=/usr/dt/app-defaults/%L/Dt key=AWS_CLOUDWATCH_HOME; value=/opt/aws/apitools/mon key=EC2_AMITOOL_HOME; value=/opt/aws/amitools/ec2 key=HADOOP_HOME_WARN_SUPPRESS; value=true key=PIG_CLASSPATH; value=/home/hadoop/pig/lib key=AWS_RDS_HOME; value=/opt/aws/apitools/rds key=LESS_TERMCAP_se; value= key=SSH_TTY; value=/dev/pts/0 key=MAHOUT_CONF_DIR; value=/home/hadoop/mahout/conf key=HBASE_CONF_DIR; value=/home/hadoop/hbase/conf key=LOGNAME; value=hadoop key=YARN_CONF_DIR; value=/home/hadoop/conf key=AWS_PATH; value=/opt/aws key=HADOOP_HOME; value=/home/hadoop key=LD_LIBRARY_PATH; value=/home/hadoop/lib/native:/usr/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib::/home/hadoop/lib/native key=MALLOC_ARENA_MAX; value=4 key=SSH_CONNECTION; value=54.240.217.9 19295 172.31.36.126 22 key=HADOOP_OPTS; value= -server -Dhadoop.log.dir=/home/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/home/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -XX:MaxPermSize=128m -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30 key=MAHOUT_LOG_DIR; value=/mnt/var/log/apps key=SHELL; value=/bin/bash key=LCCTYPE; value=UTF-8 key=CLASSPATH; value=/home/hadoop/conf:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/:/home/hadoop/share/hadoop/hdfs:/home/hadoop/share/hadoop/hdfs/lib/:/home/hadoop/share/hadoop/hdfs/:/home/hadoop/share/hadoop/yarn/lib/:/home/hadoop/share/hadoop/yarn/:/home/hadoop/share/hadoop/mapreduce/lib/:/home/hadoop/share/hadoop/mapreduce/::/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/lib/_ key=PIG_HOME; value=/home/hadoop/pig key=EC2_HOME; value=/opt/aws/apitools/ec2 key=LESS_TERMCAP_ue; value= key=LC_ALL; value=en_US.UTF-8 key=AWS_ELB_HOME; value=/opt/aws/apitools/elb key=USER; value=hadoop key=HADOOP_HDFS_HOME; value=/home/hadoop key=HADOOP_CLIENT_OPTS; value= -XX:MaxPermSize=128m key=RUBYOPT; value=rubygems key=HISTCONTROL; value=ignoredups key=HOME; value=/home/hadoop key=MAHOUT_HOME; value=/home/hadoop/mahout key=LESSOPEN; value=|/usr/bin/lesspipe.sh %s key=LSCOLORS; value=rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:.tar=38;5;9:.tgz=38;5;9:.arc=38;5;9:.arj=38;5;9:.taz=38;5;9:.lha=38;5;9:.lzh=38;5;9:.lzma=38;5;9:.tlz=38;5;9:.txz=38;5;9:.tzo=38;5;9:.t7z=38;5;9:.zip=38;5;9:.z=38;5;9:.Z=38;5;9:.dz=38;5;9:.gz=38;5;9:.lrz=38;5;9:.lz=38;5;9:.lzo=38;5;9:.xz=38;5;9:.bz2=38;5;9:.bz=38;5;9:.tbz=38;5;9:.tbz2=38;5;9:.tz=38;5;9:.deb=38;5;9:.rpm=38;5;9:.jar=38;5;9:.war=38;5;9:.ear=38;5;9:.sar=38;5;9:.rar=38;5;9:.alz=38;5;9:.ace=38;5;9:.zoo=38;5;9:.cpio=38;5;9:.7z=38;5;9:.rz=38;5;9:.cab=38;5;9:.jpg=38;5;13:.jpeg=38;5;13:.gif=38;5;13:.bmp=38;5;13:.pbm=38;5;13:.pgm=38;5;13:.ppm=38;5;13:.tga=38;5;13:.xbm=38;5;13:.xpm=38;5;13:.tif=38;5;13:.tiff=38;5;13:.png=38;5;13:.svg=38;5;13:.svgz=38;5;13:.mng=38;5;13:.pcx=38;5;13:.mov=38;5;13:.mpg=38;5;13:.mpeg=38;5;13:.m2v=38;5;13:.mkv=38;5;13:.ogm=38;5;13:.mp4=38;5;13:.m4v=38;5;13:.mp4v=38;5;13:.vob=38;5;13:.qt=38;5;13:.nuv=38;5;13:.wmv=38;5;13:.asf=38;5;13:.rm=38;5;13:.rmvb=38;5;13:.flc=38;5;13:.avi=38;5;13:.fli=38;5;13:.flv=38;5;13:.gl=38;5;13:.dl=38;5;13:.xcf=38;5;13:.xwd=38;5;13:.yuv=38;5;13:.cgm=38;5;13:.emf=38;5;13:.axv=38;5;13:.anx=38;5;13:.ogv=38;5;13:.ogx=38;5;13:.aac=38;5;45:.au=38;5;45:.flac=38;5;45:.mid=38;5;45:.midi=38;5;45:.mka=38;5;45:.mp3=38;5;45:.mpc=38;5;45:.ogg=38;5;45:.ra=38;5;45:.wav=38;5;45:.axa=38;5;45:.oga=38;5;45:.spx=38;5;45:.xspf=38;5;45: key=LESS_TERMCAP_us; value= key=LANG; value=en_US.UTF-8 key=HADOOP_MAPRED_HOME; value=/home/hadoop 14/09/24 18:54:39 INFO util.Utilities: Checking some environment variable is properly set. 14/09/24 18:54:39 INFO util.Utilities: HADOOP_CONF_DIR=/home/hadoop/conf 14/09/24 18:54:39 INFO util.Utilities: YARN_CONFDIR=/home/hadoop/conf 14/09/24 18:54:39 INFO util.Utilities: PATH=/home/hadoop/pig/bin:/usr/local/cuda/bin:/usr/java/latest/bin:/home/hadoop/bin:/home/hadoop/mahout/bin:/home/hadoop/hive/bin:/home/hadoop/hbase/bin:/home/hadoop/impala/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/aws/bin:/home/hadoop/cascading/tools/multitool-20140224/bin:/home/hadoop/cascading/tools/load-20140223/bin:/home/hadoop/cascading/tools/lingual-client/bin:/home/hadoop/cascading/driven/bin 14/09/24 18:54:39 INFO util.Utilities: Checking conf is correct 14/09/24 18:54:39 INFO util.Utilities: yarn.resourcemanager.hostname=0.0.0.0 14/09/24 18:54:39 INFO util.Utilities: yarn.resourcemanager.address=172.31.36.126:9022 14/09/24 18:54:39 INFO util.Utilities: yarn.resourcemanager.scheduler.address=172.31.36.126:9024 14/09/24 18:54:39 INFO util.Utilities: 0.0.0.0:8032=null 14/09/24 18:54:39 INFO util.Utilities: 0.0.0.0:8030=null 14/09/24 18:54:39 INFO util.Utilities: yarn.mpi.container.allocator=null 14/09/24 18:54:39 INFO util.Utilities: ****** 14/09/24 18:54:39 INFO util.Utilities: Connecting to ResourceManager at /172.31.36.126:9022 14/09/24 18:54:39 INFO client.Client: Got new application id=application_1411583461927_0008 14/09/24 18:54:39 INFO client.Client: Got Applicatioin: application_1411583461927_0008 14/09/24 18:54:39 INFO client.Client: Max mem capabililty of resources in this cluster 3072 14/09/24 18:54:39 INFO client.Client: Setting up application submission context for ASM 14/09/24 18:54:39 INFO client.Client: Set Application Id: application_1411583461927_0008 14/09/24 18:54:39 INFO client.Client: Set Application Name: MPICH2-cpi 14/09/24 18:54:39 INFO client.Client: Copy App Master jar from local filesystem and add to local environment 14/09/24 18:54:39 INFO client.Client: Source path: /home/hadoop/mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar 14/09/24 18:54:39 INFO client.Client: Destination path: hdfs://172.31.36.126:9000/tmp/MPICH2-cpi/8/AppMaster.jar 14/09/24 18:54:40 INFO client.Client: Copy MPI application from local filesystem to remote. 14/09/24 18:54:40 INFO client.Client: Source path: cpi 14/09/24 18:54:40 INFO client.Client: Destination path: hdfs://172.31.36.126:9000/tmp/MPICH2-cpi/8/MPIExec 14/09/24 18:54:40 INFO client.Client: Set the environment for the application master and mpi application 14/09/24 18:54:40 INFO client.Client: Trying to generate classpath for app master from current thread's classpath 14/09/24 18:54:40 INFO client.Client: Could not classpath resource from class loader 14/09/24 18:54:40 INFO client.Client: Setting up app master command 14/09/24 18:54:40 INFO client.Client: Completed setting up app master command ${JAVA_HOME}/bin/java -Xmx1024m org.apache.hadoop.yarn.mpi.server.ApplicationMaster --container_memory 1024 --num_containers 5 --priority 0 1>/AppMaster.stdout 2>/AppMaster.stderr 14/09/24 18:54:40 INFO client.Client: Submitting application to ASM 14/09/24 18:54:40 INFO client.Client: Submisstion result: true 14/09/24 18:54:40 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop 14/09/24 18:54:41 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop 14/09/24 18:54:42 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop 14/09/24 18:54:43 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop 14/09/24 18:54:44 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop 14/09/24 18:54:45 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop 14/09/24 18:54:46 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop 14/09/24 18:54:47 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop 14/09/24 18:54:48 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop 14/09/24 18:54:49 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=ip-172-31-38-17.ec2.internal, rpcPort:42455, appQueue=default, appMasterRpcPort=42455, appStartTime=1411584880804, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop 14/09/24 18:54:49 INFO util.Utilities: Connecting to ApplicationMaster at ip-172-31-38-17.ec2.internal/172.31.38.17:42455 14/09/24 18:54:49 INFO client.Client: Initializing ApplicationMaster ^C14/09/24 18:56:50 INFO util.Utilities: Killing appliation with id: application_1411583461927_0008 [hadoop@ip-172-31-36-126 ~]$ hadoop jar mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar -a cpi -M 1024 -m 1024 -n 5 14/09/24 18:56:57 INFO client.Client: Initializing Client 14/09/24 18:56:57 INFO client.Client: Container number is 5 14/09/24 18:56:57 INFO client.Client: Application Master's jar is /home/hadoop/mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar 14/09/24 18:56:57 INFO client.Client: Starting Client 14/09/24 18:56:57 INFO util.Utilities: ****_BELOW IS CONFIGUATIONS FROM Client *_* key=TERM; value=xterm-256color key=HADOOP_PREFIX; value=/home/hadoop key=PIG_CONF_DIR; value=/home/hadoop/pig/conf key=JAVA_HOME; value=/usr/java/latest key=HBASE_HOME; value=/home/hadoop/hbase key=HIVE_HOME; value=/home/hadoop/hive key=HADOOP_YARN_HOME; value=/home/hadoop key=HADOOP_DATANODE_HEAPSIZE; value=384 key=SSH_CLIENT; value=54.240.217.9 19295 22 key=HADOOP_NAMENODE_HEAPSIZE; value=768 key=YARN_HOME; value=/home/hadoop key=MAIL; value=/var/spool/mail/hadoop key=HOSTNAME; value=ip-172-31-36-126.ec2.internal key=PWD; value=/home/hadoop key=IMPALA_CONF_DIR; value=/home/hadoop/impala/conf key=LESS_TERMCAP_mb; value= key=LESS_TERMCAP_me; value= key=LESS_TERMCAP_md; value= key=NLSPATH; value=/usr/dt/lib/nls/msg/%L/%N.cat key=AWS_AUTO_SCALING_HOME; value=/opt/aws/apitools/as key=HISTSIZE; value=1000 key=HADOOP_COMMON_HOME; value=/home/hadoop key=PATH; value=/home/hadoop/pig/bin:/usr/local/cuda/bin:/usr/java/latest/bin:/home/hadoop/bin:/home/hadoop/mahout/bin:/home/hadoop/hive/bin:/home/hadoop/hbase/bin:/home/hadoop/impala/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/aws/bin:/home/hadoop/cascading/tools/multitool-20140224/bin:/home/hadoop/cascading/tools/load-20140223/bin:/home/hadoop/cascading/tools/lingual-client/bin:/home/hadoop/cascading/driven/bin key=HIVE_CONF_DIR; value=/home/hadoop/hive/conf key=HADOOPCLASSPATH; value=:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/lib/_ key=HADOOP_CONF_DIR; value=/home/hadoop/conf key=IMPALA_HOME; value=/home/hadoop/impala key=AWS_IAM_HOME; value=/opt/aws/apitools/iam key=SHLVL; value=1 key=XFILESEARCHPATH; value=/usr/dt/app-defaults/%L/Dt key=AWS_CLOUDWATCH_HOME; value=/opt/aws/apitools/mon key=EC2_AMITOOL_HOME; value=/opt/aws/amitools/ec2 key=HADOOP_HOME_WARN_SUPPRESS; value=true key=PIG_CLASSPATH; value=/home/hadoop/pig/lib key=AWS_RDS_HOME; value=/opt/aws/apitools/rds key=LESS_TERMCAP_se; value= key=SSH_TTY; value=/dev/pts/0 key=MAHOUT_CONF_DIR; value=/home/hadoop/mahout/conf key=HBASE_CONF_DIR; value=/home/hadoop/hbase/conf key=LOGNAME; value=hadoop key=YARN_CONF_DIR; value=/home/hadoop/conf key=AWS_PATH; value=/opt/aws key=HADOOP_HOME; value=/home/hadoop key=LD_LIBRARY_PATH; value=/home/hadoop/lib/native:/usr/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib::/home/hadoop/lib/native key=MALLOC_ARENA_MAX; value=4 key=SSH_CONNECTION; value=54.240.217.9 19295 172.31.36.126 22 key=HADOOP_OPTS; value= -server -Dhadoop.log.dir=/home/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/home/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -XX:MaxPermSize=128m -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30 key=MAHOUT_LOG_DIR; value=/mnt/var/log/apps key=SHELL; value=/bin/bash key=LCCTYPE; value=UTF-8 key=CLASSPATH; value=/home/hadoop/conf:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/:/home/hadoop/share/hadoop/hdfs:/home/hadoop/share/hadoop/hdfs/lib/:/home/hadoop/share/hadoop/hdfs/:/home/hadoop/share/hadoop/yarn/lib/:/home/hadoop/share/hadoop/yarn/:/home/hadoop/share/hadoop/mapreduce/lib/:/home/hadoop/share/hadoop/mapreduce/::/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/lib/_ key=PIG_HOME; value=/home/hadoop/pig key=EC2_HOME; value=/opt/aws/apitools/ec2 key=LESS_TERMCAP_ue; value= key=LC_ALL; value=en_US.UTF-8 key=AWS_ELB_HOME; value=/opt/aws/apitools/elb key=USER; value=hadoop key=HADOOP_HDFS_HOME; value=/home/hadoop key=HADOOP_CLIENT_OPTS; value= -XX:MaxPermSize=128m key=RUBYOPT; value=rubygems key=HISTCONTROL; value=ignoredups key=HOME; value=/home/hadoop key=MAHOUT_HOME; value=/home/hadoop/mahout key=LESSOPEN; value=|/usr/bin/lesspipe.sh %s key=LSCOLORS; value=rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:.tar=38;5;9:.tgz=38;5;9:.arc=38;5;9:.arj=38;5;9:.taz=38;5;9:.lha=38;5;9:.lzh=38;5;9:.lzma=38;5;9:.tlz=38;5;9:.txz=38;5;9:.tzo=38;5;9:.t7z=38;5;9:.zip=38;5;9:.z=38;5;9:.Z=38;5;9:.dz=38;5;9:.gz=38;5;9:.lrz=38;5;9:.lz=38;5;9:.lzo=38;5;9:.xz=38;5;9:.bz2=38;5;9:.bz=38;5;9:.tbz=38;5;9:.tbz2=38;5;9:.tz=38;5;9:.deb=38;5;9:.rpm=38;5;9:.jar=38;5;9:.war=38;5;9:.ear=38;5;9:.sar=38;5;9:.rar=38;5;9:.alz=38;5;9:.ace=38;5;9:.zoo=38;5;9:.cpio=38;5;9:.7z=38;5;9:.rz=38;5;9:.cab=38;5;9:.jpg=38;5;13:.jpeg=38;5;13:.gif=38;5;13:.bmp=38;5;13:.pbm=38;5;13:.pgm=38;5;13:.ppm=38;5;13:.tga=38;5;13:.xbm=38;5;13:.xpm=38;5;13:.tif=38;5;13:.tiff=38;5;13:.png=38;5;13:.svg=38;5;13:.svgz=38;5;13:.mng=38;5;13:.pcx=38;5;13:.mov=38;5;13:.mpg=38;5;13:.mpeg=38;5;13:.m2v=38;5;13:.mkv=38;5;13:.ogm=38;5;13:.mp4=38;5;13:.m4v=38;5;13:.mp4v=38;5;13:.vob=38;5;13:.qt=38;5;13:.nuv=38;5;13:.wmv=38;5;13:.asf=38;5;13:.rm=38;5;13:.rmvb=38;5;13:.flc=38;5;13:.avi=38;5;13:.fli=38;5;13:.flv=38;5;13:.gl=38;5;13:.dl=38;5;13:.xcf=38;5;13:.xwd=38;5;13:.yuv=38;5;13:.cgm=38;5;13:.emf=38;5;13:.axv=38;5;13:.anx=38;5;13:.ogv=38;5;13:.ogx=38;5;13:.aac=38;5;45:.au=38;5;45:.flac=38;5;45:.mid=38;5;45:.midi=38;5;45:.mka=38;5;45:.mp3=38;5;45:.mpc=38;5;45:.ogg=38;5;45:.ra=38;5;45:.wav=38;5;45:.axa=38;5;45:.oga=38;5;45:.spx=38;5;45:.xspf=38;5;45: key=LESS_TERMCAP_us; value= key=LANG; value=en_US.UTF-8 key=HADOOP_MAPRED_HOME; value=/home/hadoop 14/09/24 18:56:58 INFO util.Utilities: Checking some environment variable is properly set. 14/09/24 18:56:58 INFO util.Utilities: HADOOP_CONF_DIR=/home/hadoop/conf 14/09/24 18:56:58 INFO util.Utilities: YARN_CONFDIR=/home/hadoop/conf 14/09/24 18:56:58 INFO util.Utilities: PATH=/home/hadoop/pig/bin:/usr/local/cuda/bin:/usr/java/latest/bin:/home/hadoop/bin:/home/hadoop/mahout/bin:/home/hadoop/hive/bin:/home/hadoop/hbase/bin:/home/hadoop/impala/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/aws/bin:/home/hadoop/cascading/tools/multitool-20140224/bin:/home/hadoop/cascading/tools/load-20140223/bin:/home/hadoop/cascading/tools/lingual-client/bin:/home/hadoop/cascading/driven/bin 14/09/24 18:56:58 INFO util.Utilities: Checking conf is correct 14/09/24 18:56:58 INFO util.Utilities: yarn.resourcemanager.hostname=0.0.0.0 14/09/24 18:56:58 INFO util.Utilities: yarn.resourcemanager.address=172.31.36.126:9022 14/09/24 18:56:58 INFO util.Utilities: yarn.resourcemanager.scheduler.address=172.31.36.126:9024 14/09/24 18:56:58 INFO util.Utilities: 0.0.0.0:8032=null 14/09/24 18:56:58 INFO util.Utilities: 0.0.0.0:8030=null 14/09/24 18:56:58 INFO util.Utilities: yarn.mpi.container.allocator=null 14/09/24 18:56:58 INFO util.Utilities: **** 14/09/24 18:56:58 INFO util.Utilities: Connecting to ResourceManager at /172.31.36.126:9022 14/09/24 18:56:58 INFO client.Client: Got new application id=application_1411583461927_0009 14/09/24 18:56:58 INFO client.Client: Got Applicatioin: application_1411583461927_0009 14/09/24 18:56:58 INFO client.Client: Max mem capabililty of resources in this cluster 3072 14/09/24 18:56:58 INFO client.Client: Setting up application submission context for ASM 14/09/24 18:56:58 INFO client.Client: Set Application Id: application_1411583461927_0009 14/09/24 18:56:58 INFO client.Client: Set Application Name: MPICH2-cpi 14/09/24 18:56:58 INFO client.Client: Copy App Master jar from local filesystem and add to local environment 14/09/24 18:56:58 INFO client.Client: Source path: /home/hadoop/mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar 14/09/24 18:56:58 INFO client.Client: Destination path: hdfs://172.31.36.126:9000/tmp/MPICH2-cpi/9/AppMaster.jar 14/09/24 18:56:58 INFO client.Client: Copy MPI application from local filesystem to remote. 14/09/24 18:56:58 INFO client.Client: Source path: cpi 14/09/24 18:56:58 INFO client.Client: Destination path: hdfs://172.31.36.126:9000/tmp/MPICH2-cpi/9/MPIExec 14/09/24 18:56:58 INFO client.Client: Set the environment for the application master and mpi application 14/09/24 18:56:58 INFO client.Client: Trying to generate classpath for app master from current thread's classpath 14/09/24 18:56:58 INFO client.Client: Could not classpath resource from class loader 14/09/24 18:56:58 INFO client.Client: Setting up app master command 14/09/24 18:56:58 INFO client.Client: Completed setting up app master command ${JAVA_HOME}/bin/java -Xmx1024m org.apache.hadoop.yarn.mpi.server.ApplicationMaster --container_memory 1024 --num_containers 5 --priority 0 1>/AppMaster.stdout 2>/AppMaster.stderr 14/09/24 18:56:58 INFO client.Client: Submitting application to ASM 14/09/24 18:56:58 INFO client.Client: Submisstion result: true 14/09/24 18:56:58 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411585018942, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop 14/09/24 18:56:59 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411585018942, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop 14/09/24 18:57:00 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411585018942, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop 14/09/24 18:57:01 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411585018942, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop 14/09/24 18:57:02 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411585018942, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop 14/09/24 18:57:03 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411585018942, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop 14/09/24 18:57:04 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=ip-172-31-38-17.ec2.internal, rpcPort:56056, appQueue=default, appMasterRpcPort=56056, appStartTime=1411585018942, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop 14/09/24 18:57:05 INFO util.Utilities: Connecting to ApplicationMaster at ip-172-31-38-17.ec2.internal/172.31.38.17:56056 14/09/24 18:57:05 INFO client.Client: Initializing ApplicationMaster

Data and apps seem to be available at hdfs:

[hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls / Found 1 items drwxrwx--- - hadoop supergroup 0 2014-09-24 18:50 /tmp [hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp Found 2 items drwxrwx--- - hadoop supergroup 0 2014-09-24 18:56 /tmp/MPICH2-cpi drwxrwx--- - hadoop supergroup 0 2014-09-24 18:31 /tmp/hadoop-yarn [hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp/hadoop-yarn Found 1 items drwxrwx--- - hadoop supergroup 0 2014-09-24 18:31 /tmp/hadoop-yarn/staging [hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp/hadoop-yarn/staging Found 1 items drwxrwx--- - hadoop supergroup 0 2014-09-24 18:31 /tmp/hadoop-yarn/staging/history [hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp/hadoop-yarn/staging/history Found 2 items drwxrwx--- - hadoop supergroup 0 2014-09-24 18:31 /tmp/hadoop-yarn/staging/history/done drwxrwxrwt - hadoop supergroup 0 2014-09-24 18:31 /tmp/hadoop-yarn/staging/history/done_intermediate [hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp/hadoop-yarn/staging/history/done [hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp/MPICH2-cpi Found 3 items drwxrwx--- - hadoop supergroup 0 2014-09-24 18:50 /tmp/MPICH2-cpi/7 drwxrwx--- - hadoop supergroup 0 2014-09-24 18:54 /tmp/MPICH2-cpi/8 drwxrwx--- - hadoop supergroup 0 2014-09-24 18:56 /tmp/MPICH2-cpi/9 [hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp/MPICH2-cpi/9 Found 2 items -rw-r--r-- 2 hadoop supergroup 96333 2014-09-24 18:56 /tmp/MPICH2-cpi/9/AppMaster.jar -rw-r--r-- 2 hadoop supergroup 9598 2014-09-24 18:56 /tmp/MPICH2-cpi/9/MPIExec [hadoop@ip-172-31-36-126 ~]$

Any idea how to debug this?

Thanks a lot Markus

schmidb commented 9 years ago

mpich-yarn is only required on master node, correct? ( "mpicc for MPICH version 3.1.2" is installed on all machines and in the path variable!)

stevenybw commented 9 years ago

Hi, @schmidb !

Actaully, mpich-yarn is only required at client site, and mpich-yarn will distribute necessary files. It seems that the log you submitted here is from console, which is not complete. The complete log for ApplicationMaster can be viewed in YARN ResouceManager webapp at http://${RM_ADDRESS}:8088. plz send that log here and we will check what is wrong...

Thank you! Steven

schmidb commented 9 years ago

Hi @stevenybw

I found an error in the AppMaster.stderr

14/09/25 09:23:49 INFO server.ApplicationMaster: Initializing ApplicationMaster 14/09/25 09:23:49 INFO server.ApplicationMaster: Application master for app, appId=2, clustertimestamp=1411636003283, attemptId=1 14/09/25 09:23:49 INFO server.ApplicationMaster: HDFS mpi application location: hdfs://172.31.44.150:9000/tmp/MPICH2-cpi/2/MPIExec 14/09/25 09:23:49 INFO server.ApplicationMaster: HDFS AppMaster.jar location: hdfs://172.31.44.150:9000/tmp/MPICH2-cpi/2/AppMaster.jar 14/09/25 09:23:49 INFO server.ApplicationMaster: Environment NM_HOST is ip-172-31-42-162.ec2.internal 14/09/25 09:23:49 INFO server.ApplicationMaster: Container memory is 1024 MB 14/09/25 09:23:49 INFO server.ApplicationMaster: Starting ApplicationMaster 14/09/25 09:23:49 INFO server.ApplicationMaster: Creating AM<->RM Protocol... 14/09/25 09:23:49 INFO client.RMProxy: Connecting to ResourceManager at /172.31.44.150:9024 14/09/25 09:23:49 INFO server.ApplicationMaster: Creating AM<->NM Protocol... 14/09/25 09:23:49 INFO impl.NMClientAsyncImpl: Upper bound of the thread pool size is 500 14/09/25 09:23:49 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-nodemanagers-proxies : 500 14/09/25 09:23:49 INFO server.ApplicationMaster: Initializing MPDProtocal's RPC services... 14/09/25 09:23:49 INFO server.TaskHeartbeatHandler: TaskHeartbeatHandler starts successfully 14/09/25 09:23:49 INFO server.ApplicationMaster: Starting MPDProtocal's RPC services... 14/09/25 09:23:49 INFO ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 14/09/25 09:23:49 INFO ipc.Server: Starting Socket Reader #1 for port 60736 14/09/25 09:23:49 INFO ipc.Server: IPC Server Responder: starting 14/09/25 09:23:49 INFO ipc.Server: IPC Server listener on 60736: starting 14/09/25 09:23:49 INFO server.TaskHeartbeatHandler: TaskHeartbeatHandler PingChecker starts successfully 14/09/25 09:23:49 INFO server.ApplicationMaster: Initiallizing MPIClient service and WebApp... 14/09/25 09:23:49 INFO server.ApplicationMaster: Starting MPIClient service... 14/09/25 09:23:49 INFO server.MPIClientService: Initializing MPIClientProtocol's RPC services 14/09/25 09:23:49 INFO ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 14/09/25 09:23:49 INFO ipc.Server: Starting Socket Reader #1 for port 59033 14/09/25 09:23:49 INFO ipc.Server: IPC Server Responder: starting 14/09/25 09:23:49 INFO ipc.Server: IPC Server listener on 59033: starting 14/09/25 09:23:49 INFO server.MPIClientService: Starting MPIClientProtocol's RPC service atip-172-31-42-162.ec2.internal/172.31.42.162:59033 14/09/25 09:23:50 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 14/09/25 09:23:50 INFO http.HttpRequestLog: Http request log for http.requests.mapreduce is not defined 14/09/25 09:23:50 INFO http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter) 14/09/25 09:23:50 INFO http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context mapreduce 14/09/25 09:23:50 INFO http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static 14/09/25 09:23:50 INFO http.HttpServer2: adding path spec: /mapreduce/* 14/09/25 09:23:50 INFO http.HttpServer2: Jetty bound to port 57334 14/09/25 09:23:50 INFO mortbay.log: jetty-6.1.26-emr 14/09/25 09:23:50 INFO mortbay.log: Extract jar:file:/home/hadoop/.versions/2.4.0/share/hadoop/yarn/hadoop-yarn-common-2.4.0.jar!/webapps/mapreduce to /tmp/Jetty_0_0_0_0_57334_mapreduce____.9fpr09/webapp 14/09/25 09:23:50 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:57334 14/09/25 09:23:50 INFO webapp.WebApps: Web app /mapreduce started at 57334 14/09/25 09:23:51 INFO webapp.WebApps: Registered webapp guice modules 14/09/25 09:23:51 INFO server.ApplicationMaster: Application Master tracking url is ip-172-31-42-162.ec2.internal:57334 14/09/25 09:23:51 INFO server.ApplicationMaster: Trying to generate key to temp dir: /mnt/var/lib/hadoop/tmp/2-1 14/09/25 09:23:51 INFO server.ApplicationMaster: Keypair with public key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDqbdkV2hZIWX3U91EtBIDar7iQJS1WwIf2l8t6oPIB6ZYS+iC2W90SyuPSKdhLDrIG+1gRln6RShMUCRG1ZVoOhbtBY1XpoftCvJyyRRLsPVy7/O79M07iwMWSEXhkYhveUPZAzABoRgFLHaqPv5AQeBepBHAbDuOvPIPGiA7P0+4G94ZUUnMhmSbgSnpUKn605NQ1u309tpIReyVbKqz+2NdBj4hENhkdmDEMjiFy5KpixQocUey163Wwry14+HKFWf3pBCi89eLAJsn6O4O0SvYBou/Lih+t7AJcv/PyTQxgZsDUth4H5DcX5VRLjuyJL6OoCHG4EM2mtTx/+wL7 14/09/25 09:23:51 INFO server.ApplicationMaster: Generated public key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDqbdkV2hZIWX3U91EtBIDar7iQJS1WwIf2l8t6oPIB6ZYS+iC2W90SyuPSKdhLDrIG+1gRln6RShMUCRG1ZVoOhbtBY1XpoftCvJyyRRLsPVy7/O79M07iwMWSEXhkYhveUPZAzABoRgFLHaqPv5AQeBepBHAbDuOvPIPGiA7P0+4G94ZUUnMhmSbgSnpUKn605NQ1u309tpIReyVbKqz+2NdBj4hENhkdmDEMjiFy5KpixQocUey163Wwry14+HKFWf3pBCi89eLAJsn6O4O0SvYBou/Lih+t7AJcv/PyTQxgZsDUth4H5DcX5VRLjuyJL6OoCHG4EM2mtTx/+wL7 14/09/25 09:23:52 INFO server.ApplicationMaster: Max mem capabililty of resources in this cluster 3072 14/09/25 09:23:52 INFO server.ApplicationMaster: Requested container ask: Capability[<memory:1024, vCores:1>]Priority[0] 14/09/25 09:23:52 INFO server.ApplicationMaster: Requested container ask: Capability[<memory:1024, vCores:1>]Priority[0] 14/09/25 09:23:52 INFO server.ApplicationMaster: Try to allocate 2 containers with heartbeat interval = 1000 ms. 14/09/25 09:23:53 INFO impl.AMRMClientImpl: Received new token for : ip-172-31-42-163.ec2.internal:9103 14/09/25 09:23:53 INFO handler.MPIAMRMAsyncHandler: AcquiredContainer: Id=container_1411636003283_0002_01_000002, NodeId=ip-172-31-42-163.ec2.internal:9103, Host=ip-172-31-42-163.ec2.internal 14/09/25 09:23:53 INFO handler.MPIAMRMAsyncHandler: Current=1, Needed=2 14/09/25 09:23:54 INFO impl.AMRMClientImpl: Received new token for : ip-172-31-42-166.ec2.internal:9103 14/09/25 09:23:54 INFO handler.MPIAMRMAsyncHandler: AcquiredContainer: Id=container_1411636003283_0002_01_000003, NodeId=ip-172-31-42-163.ec2.internal:9103, Host=ip-172-31-42-163.ec2.internal 14/09/25 09:23:54 INFO handler.MPIAMRMAsyncHandler: AcquiredContainer: Id=container_1411636003283_0002_01_000004, NodeId=ip-172-31-42-166.ec2.internal:9103, Host=ip-172-31-42-166.ec2.internal 14/09/25 09:23:54 INFO handler.MPIAMRMAsyncHandler: Current=3, Needed=2 14/09/25 09:23:55 INFO server.ApplicationMaster: 2 containers allocated. 14/09/25 09:23:55 INFO server.ApplicationMaster: Launching command on a new container, containerId=container_1411636003283_0002_01_000002, containerNode=ip-172-31-42-163.ec2.internal:9103, containerNodeURI=ip-172-31-42-163.ec2.internal:9035, containerResourceMemory1024 14/09/25 09:23:55 INFO server.ApplicationMaster: Setting up container launch container for containerid=container_1411636003283_0002_01_000002 14/09/25 09:23:55 INFO server.ApplicationMaster: Set the environment for the application master 14/09/25 09:23:55 INFO server.ApplicationMaster: Setting up container command 14/09/25 09:23:55 INFO server.ApplicationMaster: Executing command: [${JAVA_HOME}/bin/java -Xmx1024m org.apache.hadoop.yarn.mpi.server.Container 1>/stdout 2>/stderr ] 14/09/25 09:23:55 INFO server.ApplicationMaster: Launching command on a new container, containerId=container_1411636003283_0002_01_000004, containerNode=ip-172-31-42-166.ec2.internal:9103, containerNodeURI=ip-172-31-42-166.ec2.internal:9035, containerResourceMemory1024 14/09/25 09:23:55 INFO server.ApplicationMaster: Setting up container launch container for containerid=container_1411636003283_0002_01_000004 14/09/25 09:23:55 INFO server.ApplicationMaster: Set the environment for the application master 14/09/25 09:23:55 INFO server.ApplicationMaster: Setting up container command 14/09/25 09:23:55 INFO server.ApplicationMaster: Executing command: [${JAVA_HOME}/bin/java -Xmx1024m org.apache.hadoop.yarn.mpi.server.Container 1>/stdout 2>/stderr ] 14/09/25 09:23:55 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1411636003283_0002_01_000002 14/09/25 09:23:55 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1411636003283_0002_01_000004 14/09/25 09:23:55 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-42-163.ec2.internal:9103 14/09/25 09:23:55 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-42-166.ec2.internal:9103 14/09/25 09:23:55 INFO handler.MPINMAsyncHandler: onContainerStarted invoked. 14/09/25 09:23:55 INFO handler.MPINMAsyncHandler: onContainerStarted invoked. 14/09/25 09:23:58 INFO handler.MPIAMRMAsyncHandler: CompletedContainer: Id=container_1411636003283_0002_01_000002 14/09/25 09:23:58 INFO handler.MPIAMRMAsyncHandler: CompletedContainer: Id=container_1411636003283_0002_01_000004 Sep 25, 2014 9:25:12 AM com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider get WARNING: You are attempting to use a deprecated API (specifically, attempting to @Inject ServletContext inside an eagerly created singleton. While we allow this for backwards compatibility, be warned that this MAY have unexpected behavior if you have more than one injector (with ServletModule) running in the same JVM. Please consult the Guice documentation at http://code.google.com/p/google-guice/wiki/Servlets for more information. 14/09/25 09:29:19 INFO server.MPDListenerImpl: Try to report status. 14/09/25 09:29:19 INFO server.MPDListenerImpl: container_1411636003283_0002_01_000004 report status DISCONNECTED 14/09/25 09:29:19 ERROR server.TaskHeartbeatHandler: containerId:container_1411636003283_0002_01_000004 timed out after 300 second 14/09/25 09:29:49 INFO server.MPDListenerImpl: Try to report status. 14/09/25 09:29:49 INFO server.MPDListenerImpl: container_1411636003283_0002_01_000004 report status DISCONNECTED 14/09/25 09:29:49 ERROR server.TaskHeartbeatHandler: containerId:container_1411636003283_0002_01_000004 timed out after 300 second 14/09/25 09:30:19 INFO server.MPDListenerImpl: Try to report status. 14/09/25 09:30:19 INFO server.MPDListenerImpl: container_1411636003283_0002_01_000004 report status DISCONNECTED 14/09/25 09:30:19 ERROR server.TaskHeartbeatHandler: containerId:container_1411636003283_0002_01_000004 timed out after 300 second 14/09/25 09:30:49 INFO server.MPDListenerImpl: Try to report status. 14/09/25 09:30:49 INFO server.MPDListenerImpl: container_1411636003283_0002_01_000004 report status DISCONNECTED 14/09/25 09:30:49 ERROR server.TaskHeartbeatHandler: containerId:container_1411636003283_0002_01_000004 timed out after 300 second 14/09/25 09:31:19 INFO server.MPDListenerImpl: Try to report status. 14/09/25 09:31:19 INFO server.MPDListenerImpl: container_1411636003283_0002_01_000004 report status DISCONNECTED 14/09/25 09:31:19 ERROR server.TaskHeartbeatHandler: containerId:container_1411636003283_0002_01_000004 timed out after 300 second 14/09/25 09:31:49 INFO server.MPDListenerImpl: Try to report status. 14/09/25 09:31:49 INFO server.MPDListenerImpl: container_1411636003283_0002_01_000004 report status DISCONNECTED 14/09/25 09:31:49 ERROR server.TaskHeartbeatHandler: containerId:container_1411636003283_0002_01_000004 timed out after 300 second 14/09/25 09:32:19 INFO server.MPDListenerImpl: Try to report status. 14/09/25 09:32:19 INFO server.MPDListenerImpl: container_1411636003283_0002_01_000004 report status DISCONNECTED 14/09/25 09:32:19 ERROR server.TaskHeartbeatHandler: containerId:container_1411636003283_0002_01_000004 timed out after 300 second 14/09/25 09:32:49 INFO server.MPDListenerImpl: Try to report status. 14/09/25 09:32:49 INFO server.MPDListenerImpl: container_1411636003283_0002_01_000004 report status DISCONNECTED 14/09/25 09:32:49 ERROR server.TaskHeartbeatHandler: containerId:container_1411636003283_0002_01_000004 timed out after 300 second 14/09/25 09:33:19 INFO server.MPDListenerImpl: Try to report status. 14/09/25 09:33:19 INFO server.MPDListenerImpl: container_1411636003283_0002_01_000004 report status DISCONNECTED 14/09/25 09:33:19 ERROR server.TaskHeartbeatHandler: containerId:container_1411636003283_0002_01_000004 timed out after 300 second 14/09/25 09:33:49 INFO server.MPDListenerImpl: Try to report status. 14/09/25 09:33:49 INFO server.MPDListenerImpl: container_1411636003283_0002_01_000004 report status DISCONNECTED 14/09/25 09:33:49 ERROR server.TaskHeartbeatHandler: containerId:container_1411636003283_0002_01_000004 timed out after 300 second 14/09/25 09:34:19 INFO server.MPDListenerImpl: Try to report status. 14/09/25 09:34:19 INFO server.MPDListenerImpl: container_1411636003283_0002_01_000004 report status DISCONNECTED 14/09/25 09:34:19 ERROR server.TaskHeartbeatHandler: containerId:container_1411636003283_0002_01_000004 timed out after 300 second 14/09/25 09:34:49 INFO server.MPDListenerImpl: Try to report status. 14/09/25 09:34:49 INFO server.MPDListenerImpl: container_1411636003283_0002_01_000004 report status DISCONNECTED 14/09/25 09:34:49 ERROR server.TaskHeartbeatHandler: containerId:container_1411636003283_0002_01_000004 timed out after 300 second

any ideas?

Thanks a lot Markus

schmidb commented 9 years ago

I assume this is still an issue with key_less ssh. I submitted the job to yarn. At the same time I checked one of the task nodes. I expect to find a new entry in .ssh/authorized_keys. But there is no new entry!

stevenybw commented 9 years ago

It seems that the container terminated abnormally. Very likely that this is because of key_less ssh! There is slightly changes in configuration file mpi-site.xml, it's about the location of authorized keys. It has been changed 3 days ago, before that the authorized_keys are assumed to be /home/hadoop/.ssh/authorized_keys! plz add yarn.mpi.ssh.authorizedkeys.path referring to example_configuration/mpi-site.xml at github repository.

stevenybw commented 9 years ago

better to backup your authorized_key before run this time :)

schmidb commented 9 years ago

Hi, I am using the new mpi-site.xml. Now I am a bit confused: Do I have to set up key_less ssh manually or is this done by YARN or mpich2-yarn?

stevenybw commented 9 years ago

Actually this is done by mpich2-yarn. It generates a pair of RSA keys each time a new application is submitted, and allow this key temporarily, after termination, this key will be disabled again(deleted from authorized_key).

schmidb commented 9 years ago

I still have the issues that I am running into these timeouts.

I am using the latest version from github.

I am using the new mpi-site.xml file including a correct yarn.mpi.ssh.authorizedkeys.path.

I do not see any rsa files or entries in authorized_keys file. I assume the user hadoop tries to create them. Any special permission required there?

Best Markus

On 26.09.2014, at 05:09, Steven notifications@github.com wrote:

Actually this is done by mpich2-yarn. It generates a pair of RSA keys each time a new application is submitted, and allow this key temporarily, after termination, this key will be disabled again(deleted from authorized_key).

— Reply to this email directly or view it on GitHub.