Open schmidb opened 10 years ago
in the default output I have some 0.0.0.0 IPs after the yarn.resourcemanager. Is this correct?
14/10/16 11:43:41 INFO util.Utilities: PATH=/home/hadoop/pig/bin:/usr/local/cuda/bin:/usr/java/latest/bin:/home/hadoop/bin:/home/hadoop/mahout/bin:/home/hadoop/hive/bin:/home/hadoop/hbase/bin:/home/hadoop/impala/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/aws/bin:/home/hadoop/cascading/tools/multitool-20140224/bin:/home/hadoop/cascading/tools/load-20140223/bin:/home/hadoop/cascading/tools/lingual-client/bin:/home/hadoop/cascading/driven/bin 14/10/16 11:43:41 INFO util.Utilities: Checking conf is correct 14/10/16 11:43:41 INFO util.Utilities: yarn.resourcemanager.address=172.31.45.226:9022 14/10/16 11:43:41 INFO util.Utilities: yarn.resourcemanager.scheduler.address=172.31.45.226:9024 14/10/16 11:43:41 INFO util.Utilities: 0.0.0.0:8032=null 14/10/16 11:43:41 INFO util.Utilities: 0.0.0.0:8030=null 14/10/16 11:43:41 INFO util.Utilities: yarn.mpi.container.allocator=null 14/10/16 11:43:41 INFO util.Utilities: ***** 14/10/16 11:43:41 INFO util.Utilities: Connecting to ResourceManager at /172.31.45.226:9022
i met the same error output my env namenode HA viewFS RM HA
oh, this issue no one try to solve?
hi Is there any progress on this issue? I have same problem. Thanks to help me!
@schmidb, did you every figure out what is going on here? I am seeing the same thing..
Hi, I seem to have figured out what is going on here. the log you are looking at is NOT the container log. There is a separate container log. you can hack the lines that look like this:
- vargs.add("1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout");
- vargs.add("2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr");
+ vargs.add("1>" + "/tmp/stdout_"+ container.getId().toString());
+ vargs.add("2>" + "/tmp/stderr_"+ container.getId().toString());
then you can read the container logs on the machine the container has deployed to.
In my case, there was an issue with the SSH Key exchange, which was essentially that mpi-site.xml needs to be on all machines on the cluster. Once I did this, I got past this error.
where should i place mpi-site.xml to all machines? thx
Hi,
now I have time again to test mpich2-yarn on Amazon EMR. I have the following problem
AppMaster.stderr : Total file length is 11170 bytes.
... 14/10/16 11:43:55 INFO server.ApplicationMaster: Setting up container command 14/10/16 11:43:55 INFO server.ApplicationMaster: Executing command: [${JAVA_HOME}/bin/java -Xmx1024m org.apache.hadoop.yarn.mpi.server.Container 1>/stdout 2>/stderr ]
14/10/16 11:43:55 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-42-163.ec2.internal:9103
14/10/16 11:43:55 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1413458972800_0006_01_000003
14/10/16 11:43:55 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-42-165.ec2.internal:9103
14/10/16 11:43:55 INFO handler.MPINMAsyncHandler: onContainerStarted invoked.
14/10/16 11:43:55 INFO handler.MPINMAsyncHandler: onContainerStarted invoked.
14/10/16 11:43:57 INFO server.MPDListenerImpl: Try to report status.
14/10/16 11:43:57 INFO server.MPDListenerImpl: container_1413458972800_0006_01_000003 report status INITIALIZED
14/10/16 11:43:57 INFO server.MPDListenerImpl: Try to report status.
14/10/16 11:43:57 INFO server.MPDListenerImpl: container_1413458972800_0006_01_000002 report status INITIALIZED
14/10/16 11:43:57 INFO server.MPDListenerImpl: Try to report status.
14/10/16 11:43:57 INFO server.MPDListenerImpl: container_1413458972800_0006_01_000003 report status ERROR_FINISHED
14/10/16 11:43:58 ERROR server.ApplicationMaster: error occurs while starting MPD
org.apache.hadoop.yarn.mpi.util.MPDException: Container container_1413458972800_0006_01_000003 error
at org.apache.hadoop.yarn.mpi.server.MPDListenerImpl.isAllMPDStarted(MPDListenerImpl.java:121)
at org.apache.hadoop.yarn.mpi.server.ApplicationMaster.run(ApplicationMaster.java:733)
at org.apache.hadoop.yarn.mpi.server.ApplicationMaster.main(ApplicationMaster.java:170)
org.apache.hadoop.yarn.mpi.util.MPDException: Container container_1413458972800_0006_01_000003 error
at org.apache.hadoop.yarn.mpi.server.MPDListenerImpl.isAllMPDStarted(MPDListenerImpl.java:121)
at org.apache.hadoop.yarn.mpi.server.ApplicationMaster.run(ApplicationMaster.java:733)
at org.apache.hadoop.yarn.mpi.server.ApplicationMaster.main(ApplicationMaster.java:170)
14/10/16 11:43:58 INFO server.ApplicationMaster: Application completed. Stopping running containers
14/10/16 11:43:58 INFO impl.ContainerManagementProtocolProxy: Closing proxy : ip-172-31-42-163.ec2.internal:9103
14/10/16 11:43:58 INFO impl.ContainerManagementProtocolProxy: Closing proxy : ip-172-31-42-165.ec2.internal:9103
14/10/16 11:43:58 INFO server.ApplicationMaster: Application completed. Signalling finish to RM
14/10/16 11:43:58 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
14/10/16 11:43:58 INFO server.ApplicationMaster: AMRM, NM two services stopped
14/10/16 11:43:58 INFO impl.AMRMClientAsyncImpl: Interrupted while waiting for queue
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:275)
14/10/16 11:43:58 INFO server.ApplicationMaster: Finalizing.
14/10/16 11:43:58 INFO server.ApplicationMaster: Application Master failed. exiting
...
Any idea how to fix that? Key-less ssh to nodes now works.
Thanks Markus