argiopetech / base

Bayesian Analysis for Stellar Evolution
http://webfac.db.erau.edu/~vonhippt/base9/
11 stars 4 forks source link

mpi error, possibly benign #11

Closed tedvh closed 10 years ago

tedvh commented 11 years ago

mpiMcmc run on NGC 188 data gave the following result on my Macbook Pro

puuoo> mpi_mcmc_ngc188.csh & [1] 33274 puuoo> time stamp: Thu May 16 14:55:04 EDT 2013

running mcmc for 5000 + 10000 x 1 steps ... time stamp: Thu May 16 14:55:04 EDT 2013

PID 33284 on Teds-MacBook-Pro.local ready for attach PID 33285 on Teds-MacBook-Pro.local ready for attach PID 33286 on Teds-MacBook-Pro.local ready for attach PID 33279 on Teds-MacBook-Pro.local ready for attach PID 33280 on Teds-MacBook-Pro.local ready for attach PID 33281 on Teds-MacBook-Pro.local ready for attach PID 33282 on Teds-MacBook-Pro.local ready for attach PID 33283 on Teds-MacBook-Pro.local ready for attach Choose a main sequence isochrone set:

  1. Girardi
  2. Chaboyer-Dotter (w/helium sampling)
  3. Yonsei-Yale
  4. DSED

Choose a filter set:

  1. Standard (UBVRIJHK)
  2. ACS
  3. SDSS (ugriz) + 2Mass (JHK) Choose a white dwarf filter set:
  4. Wood
  5. Montgomery 0 Choose a white dwarf carbonicity (between 0.0 and 1.0): Choose an initial-final mass relation:
  6. Weidemann
  7. Williams
  8. Salaris Linear
  9. Salaris Piecewise Linear Choose a brown dwarf model set:
  10. None
  11. Baraffe

Reading models... Models read. Bayesian analysis of stellar evolution

puuoo> 10.6232 -5.86097 -20.9932 -6.56127 -5.86097 7.99048 8.63864 1.20933 -20.9932 8.63864 46.4654 13.8271 -6.56127 1.20933 13.8271 7.29877

puuoo> Acceptance ratio: 0.266333 [Teds-MacBook-Pro.local:33286] [[57174,1],7]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:33279] [[57174,1],0]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:33280] [[57174,1],1]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:33282] [[57174,1],3]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:33283] [[57174,1],4]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:33284] [[57174,1],5]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:33285] [[57174,1],6]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60)

[Teds-MacBook-Pro.local:33281] [[57174,1],2]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60)

mpirun has exited due to process rank 0 with PID 33279 on node Teds-MacBook-Pro.local exiting improperly. There are two reasons this could occur:

  1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination.
  2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be

terminated by signals sent by mpirun (as reported here).

294862.023u 15009.330s 11:07:10.93 774.0% 0+0k 9+262io 1426pf+0w time stamp: Fri May 17 02:02:15 EDT 2013

The output files look OK.

tedvh commented 11 years ago

Happened again on my Macbook Pro. This time, instead of 8 processors, it was only 3. The script and the dataset were different. The dataset was simulated. All output looks fine.

[Teds-MacBook-Pro.local:11342] [[29413,1],0]-[[29413,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:11343] [[29413,1],1]-[[29413,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60)

[Teds-MacBook-Pro.local:11344] [[29413,1],2]-[[29413,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60)

mpirun has exited due to process rank 0 with PID 11342 on node Teds-MacBook-Pro.local exiting improperly. There are two reasons this could occur:

  1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination.
  2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be

terminated by signals sent by mpirun (as reported here).

tedvh commented 11 years ago

I just remembered that I had left my computer running on the above 3-processor job, but disconnected the internet. Thought that might be a useful piece of information.

argiopetech commented 11 years ago

Hurray for intermittent bugs. I'll play around on my machine and see if I can duplicate it. Seems very much like an mpi issue though.

btracey commented 10 years ago

I have gotten this issue (OSX 10.9). At least in my case, I believe the issue occurs when the computer lid is shut (putting the computer to sleep), and then later trying to write to a file.

argiopetech commented 10 years ago

Thank you for the report, Brendan. I am actually leaning toward closing this issue, as the current version of the code (9.3.1) uses C++11 threads rather than MPI. Unless you're running in a clustered environment, I highly recommend updating.

When we re-implement clustered capabilities in (probably) 9.4, I'll check back into this.