alibaba / mpich2-yarn

Running MPICH2 on Yarn
114 stars 62 forks source link

error in AppMaster.stderr #37

Open schmidb opened 10 years ago

schmidb commented 10 years ago

Hi,

now I have time again to test mpich2-yarn on Amazon EMR. I have the following problem

AppMaster.stderr : Total file length is 11170 bytes.

... 14/10/16 11:43:55 INFO server.ApplicationMaster: Setting up container command 14/10/16 11:43:55 INFO server.ApplicationMaster: Executing command: [${JAVA_HOME}/bin/java -Xmx1024m org.apache.hadoop.yarn.mpi.server.Container 1>/stdout 2>/stderr ] 14/10/16 11:43:55 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-42-163.ec2.internal:9103 14/10/16 11:43:55 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1413458972800_0006_01_000003 14/10/16 11:43:55 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-42-165.ec2.internal:9103 14/10/16 11:43:55 INFO handler.MPINMAsyncHandler: onContainerStarted invoked. 14/10/16 11:43:55 INFO handler.MPINMAsyncHandler: onContainerStarted invoked. 14/10/16 11:43:57 INFO server.MPDListenerImpl: Try to report status. 14/10/16 11:43:57 INFO server.MPDListenerImpl: container_1413458972800_0006_01_000003 report status INITIALIZED 14/10/16 11:43:57 INFO server.MPDListenerImpl: Try to report status. 14/10/16 11:43:57 INFO server.MPDListenerImpl: container_1413458972800_0006_01_000002 report status INITIALIZED 14/10/16 11:43:57 INFO server.MPDListenerImpl: Try to report status. 14/10/16 11:43:57 INFO server.MPDListenerImpl: container_1413458972800_0006_01_000003 report status ERROR_FINISHED 14/10/16 11:43:58 ERROR server.ApplicationMaster: error occurs while starting MPD org.apache.hadoop.yarn.mpi.util.MPDException: Container container_1413458972800_0006_01_000003 error at org.apache.hadoop.yarn.mpi.server.MPDListenerImpl.isAllMPDStarted(MPDListenerImpl.java:121) at org.apache.hadoop.yarn.mpi.server.ApplicationMaster.run(ApplicationMaster.java:733) at org.apache.hadoop.yarn.mpi.server.ApplicationMaster.main(ApplicationMaster.java:170) org.apache.hadoop.yarn.mpi.util.MPDException: Container container_1413458972800_0006_01_000003 error at org.apache.hadoop.yarn.mpi.server.MPDListenerImpl.isAllMPDStarted(MPDListenerImpl.java:121) at org.apache.hadoop.yarn.mpi.server.ApplicationMaster.run(ApplicationMaster.java:733) at org.apache.hadoop.yarn.mpi.server.ApplicationMaster.main(ApplicationMaster.java:170) 14/10/16 11:43:58 INFO server.ApplicationMaster: Application completed. Stopping running containers 14/10/16 11:43:58 INFO impl.ContainerManagementProtocolProxy: Closing proxy : ip-172-31-42-163.ec2.internal:9103 14/10/16 11:43:58 INFO impl.ContainerManagementProtocolProxy: Closing proxy : ip-172-31-42-165.ec2.internal:9103 14/10/16 11:43:58 INFO server.ApplicationMaster: Application completed. Signalling finish to RM 14/10/16 11:43:58 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 14/10/16 11:43:58 INFO server.ApplicationMaster: AMRM, NM two services stopped 14/10/16 11:43:58 INFO impl.AMRMClientAsyncImpl: Interrupted while waiting for queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:275) 14/10/16 11:43:58 INFO server.ApplicationMaster: Finalizing. 14/10/16 11:43:58 INFO server.ApplicationMaster: Application Master failed. exiting ...

Any idea how to fix that? Key-less ssh to nodes now works.

Thanks Markus

schmidb commented 10 years ago

in the default output I have some 0.0.0.0 IPs after the yarn.resourcemanager. Is this correct?

14/10/16 11:43:41 INFO util.Utilities: PATH=/home/hadoop/pig/bin:/usr/local/cuda/bin:/usr/java/latest/bin:/home/hadoop/bin:/home/hadoop/mahout/bin:/home/hadoop/hive/bin:/home/hadoop/hbase/bin:/home/hadoop/impala/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/aws/bin:/home/hadoop/cascading/tools/multitool-20140224/bin:/home/hadoop/cascading/tools/load-20140223/bin:/home/hadoop/cascading/tools/lingual-client/bin:/home/hadoop/cascading/driven/bin 14/10/16 11:43:41 INFO util.Utilities: Checking conf is correct 14/10/16 11:43:41 INFO util.Utilities: yarn.resourcemanager.address=172.31.45.226:9022 14/10/16 11:43:41 INFO util.Utilities: yarn.resourcemanager.scheduler.address=172.31.45.226:9024 14/10/16 11:43:41 INFO util.Utilities: 0.0.0.0:8032=null 14/10/16 11:43:41 INFO util.Utilities: 0.0.0.0:8030=null 14/10/16 11:43:41 INFO util.Utilities: yarn.mpi.container.allocator=null 14/10/16 11:43:41 INFO util.Utilities: ***** 14/10/16 11:43:41 INFO util.Utilities: Connecting to ResourceManager at /172.31.45.226:9022

qiaohaijun commented 10 years ago

i met the same error output my env namenode HA viewFS RM HA

qiaohaijun commented 10 years ago

oh, this issue no one try to solve?

hadimansouri commented 9 years ago

hi Is there any progress on this issue? I have same problem. Thanks to help me!

rhl- commented 8 years ago

@schmidb, did you every figure out what is going on here? I am seeing the same thing..

rhl- commented 8 years ago

Hi, I seem to have figured out what is going on here. the log you are looking at is NOT the container log. There is a separate container log. you can hack the lines that look like this:

-    vargs.add("1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout");
-    vargs.add("2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr");
+    vargs.add("1>" + "/tmp/stdout_"+ container.getId().toString());
+    vargs.add("2>" + "/tmp/stderr_"+ container.getId().toString());

then you can read the container logs on the machine the container has deployed to.

In my case, there was an issue with the SSH Key exchange, which was essentially that mpi-site.xml needs to be on all machines on the cluster. Once I did this, I got past this error.

zmoon111 commented 8 years ago

where should i place mpi-site.xml to all machines? thx