GeoscienceAustralia / eqrm

Automatically exported from code.google.com/p/eqrm
Other
5 stars 4 forks source link

test_all failing on rhe-compute1. Works on Tornado. #10

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run test_all on rhe-copmute1
2.
3.

What is the expected output? What do you see instead?

This might be two problems.
1. mkdir is failing.
2. mpi is failing.

Let's do the mkdir issues problem first, then discuss the mpi issue.

Here's the results;

dgray@rhe-compute1:/nas/gemd/georisk_models/earthquake/sandpits/duncan/EQRM/trun
k/eqrm_core$ p all_test.py 

Testing path 
/nas/gemd/georisk_models/earthquake/sandpits/duncan/EQRM/trunk/eqrm_core/eqrm_co
de/..:

...............................................................................[
rhe-compute1.ga.gov.au:07988] opal_os_dirpath_create: Error: Unable to create 
the sub-directory (/tmp/openmpi-sessions-dgray@rhe-compute1.ga.gov.au_0/43174) 
of (/tmp/openmpi-sessions-dgray@rhe-compute1.ga.gov.au_0/43174/0/0), mkdir 
failed [1]
[rhe-compute1.ga.gov.au:07988] [[43174,0],0] ORTE_ERROR_LOG: Error in file 
util/session_dir.c at line 101
[rhe-compute1.ga.gov.au:07988] [[43174,0],0] ORTE_ERROR_LOG: Error in file 
util/session_dir.c at line 425
[rhe-compute1.ga.gov.au:07988] [[43174,0],0] ORTE_ERROR_LOG: Error in file 
ess_hnp_module.c at line 304
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[rhe-compute1.ga.gov.au:07988] [[43174,0],0] ORTE_ERROR_LOG: Error in file 
runtime/orte_init.c at line 132
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[rhe-compute1.ga.gov.au:07988] [[43174,0],0] ORTE_ERROR_LOG: Error in file 
orted/orted_main.c at line 323
[rhe-compute1.ga.gov.au:07985] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to 
start a daemon on the local node in file ess_singleton_module.c at line 381
[rhe-compute1.ga.gov.au:07985] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to 
start a daemon on the local node in file ess_singleton_module.c at line 143
[rhe-compute1.ga.gov.au:07985] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to 
start a daemon on the local node in file runtime/orte_init.c at line 132
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Unable to start a daemon on the local node (-128) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Unable to start a daemon on the local node" (-128) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[rhe-compute1.ga.gov.au:7985] Abort before MPI_INIT completed successfully; not 
able to guarantee that all other processes were killed!

Original issue reported on code.google.com by duncan.g...@gmail.com on 21 Feb 2012 at 12:15

GoogleCodeExporter commented 9 years ago
Ben suggested that /tmp was full, and that was the problem.
David B. rm'ed his temp files.  Drive used went from 100% to 10%.
We are going to stop using /tmp.

Original comment by duncan.g...@gmail.com on 21 Feb 2012 at 12:45