Closed tedvh closed 10 years ago
Happened again on my Macbook Pro. This time, instead of 8 processors, it was only 3. The script and the dataset were different. The dataset was simulated. All output looks fine.
[Teds-MacBook-Pro.local:11342] [[29413,1],0]-[[29413,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:11343] [[29413,1],1]-[[29413,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60)
mpirun has exited due to process rank 0 with PID 11342 on node Teds-MacBook-Pro.local exiting improperly. There are two reasons this could occur:
This may have caused other processes in the application to be
I just remembered that I had left my computer running on the above 3-processor job, but disconnected the internet. Thought that might be a useful piece of information.
Hurray for intermittent bugs. I'll play around on my machine and see if I can duplicate it. Seems very much like an mpi issue though.
I have gotten this issue (OSX 10.9). At least in my case, I believe the issue occurs when the computer lid is shut (putting the computer to sleep), and then later trying to write to a file.
Thank you for the report, Brendan. I am actually leaning toward closing this issue, as the current version of the code (9.3.1) uses C++11 threads rather than MPI. Unless you're running in a clustered environment, I highly recommend updating.
When we re-implement clustered capabilities in (probably) 9.4, I'll check back into this.
mpiMcmc run on NGC 188 data gave the following result on my Macbook Pro
puuoo> mpi_mcmc_ngc188.csh & [1] 33274 puuoo> time stamp: Thu May 16 14:55:04 EDT 2013
running mcmc for 5000 + 10000 x 1 steps ... time stamp: Thu May 16 14:55:04 EDT 2013
PID 33284 on Teds-MacBook-Pro.local ready for attach PID 33285 on Teds-MacBook-Pro.local ready for attach PID 33286 on Teds-MacBook-Pro.local ready for attach PID 33279 on Teds-MacBook-Pro.local ready for attach PID 33280 on Teds-MacBook-Pro.local ready for attach PID 33281 on Teds-MacBook-Pro.local ready for attach PID 33282 on Teds-MacBook-Pro.local ready for attach PID 33283 on Teds-MacBook-Pro.local ready for attach Choose a main sequence isochrone set:
puuoo> 10.6232 -5.86097 -20.9932 -6.56127 -5.86097 7.99048 8.63864 1.20933 -20.9932 8.63864 46.4654 13.8271 -6.56127 1.20933 13.8271 7.29877
puuoo> Acceptance ratio: 0.266333 [Teds-MacBook-Pro.local:33286] [[57174,1],7]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:33279] [[57174,1],0]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:33280] [[57174,1],1]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:33282] [[57174,1],3]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:33283] [[57174,1],4]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:33284] [[57174,1],5]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60) [Teds-MacBook-Pro.local:33285] [[57174,1],6]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60)
[Teds-MacBook-Pro.local:33281] [[57174,1],2]-[[57174,0],0] mca_oob_tcp_msg_recv: readv failed: Operation timed out (60)
mpirun has exited due to process rank 0 with PID 33279 on node Teds-MacBook-Pro.local exiting improperly. There are two reasons this could occur:
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
294862.023u 15009.330s 11:07:10.93 774.0% 0+0k 9+262io 1426pf+0w time stamp: Fri May 17 02:02:15 EDT 2013
The output files look OK.