Visualization plugin hangs on some multi-node setups

spco commented 8 years ago

Running on a multiple node setup on a cluster can hang on the visualization plugin, only interrupted by the cluster's walltime limits.

On my cluster:

Disabling the visualization plugin allows the program to run.
Alternatively, using 'set Number of grouped files = 1' in the Visualization subsection allows continued running, and output.
I couldn't get changing TMP or TMPDIR to have any effect.
Running on up to 16 cores on one node works fine.
Running step-32 across multiple nodes works fine, with visualization output.

Typical error output, for what it's worth:

anode120:UCM:746e:cb233700: 2081437717 us(2081437717 us!!!): dapl async_event CQ (0x10dfdf0) ERR 0 anode120:UCM:746e:cb233700: 2081437754 us(37 us): -- dapl_evd_cq_async_error_callback (0x107e170, 0x1080e90, 0x2aaacb232c90, 0x10dfdf0) anode120:UCM:746e:cb233700: 2081437784 us(30 us): dapl async_event QP (0x10fa3b0) Event 1 anode120:UCM:746c:cb233700: -287395973 us(-287395973 us): dapl async_event CQ (0x10dfdf0) ERR 0 anode120:UCM:746c:cb233700: -287395932 us(41 us): -- dapl_evd_cq_async_error_callback (0x107e170, 0x1080e90, 0x2aaacb232c90, 0x10dfdf0) anode120:UCM:746c:cb233700: -287395902 us(30 us): dapl async_event QP (0x10fa3b0) Event 1 anode120:UCM:7470:cb233700: -279412729 us(-279412729 us): dapl async_event CQ (0x10dfdf0) ERR 0 anode120:UCM:7470:cb233700: -279412691 us(38 us): -- dapl_evd_cq_async_error_callback (0x107e170, 0x1080e90, 0x2aaacb232c90, 0x10dfdf0) anode120:UCM:7470:cb233700: -279412605 us(86 us): dapl async_event QP (0x10fa3b0) Event 1

Using impi 4.1.3.048

See discussion on Aspect-devel mailing list in early August 2015 for more details.

bangerth commented 8 years ago

I'd love to see a backtrace to see where it hangs.

Do you know how to do that? The idea is that you wait till it hangs, then log onto one of the nodes on which the jobs run, start the debugger, and attach to one of the processes that run the program. In essence, it's like running the program in a debugger, but you're attaching the debugger to an already running program. Once you're attached, you can call backtrace to see where it hangs.

The location where it hangs may be different for different processes.

tjhei commented 8 years ago

@bangerth you implemented the "write to tmp and mv" scheme for graphical output, right? Did you have evidence that this improved performance? I assume it was faster for you because /tmp is local whereas you wanted to write to an NFS file system? My suggestion would be to not do this anymore, at least by default. We had several people report problems with this (hangs, etc.) and I sometimes see these "WARNING: could not create temporary ..." appear too.

It is also easy to write files into a local directory instead of the NFS system (I create output/ as a symlink to a local directory). If files are big enough to be a problem, one would want to use MPI IO (so use grouping>0) instead anyways.

Thoughts?

spco commented 8 years ago

I think I've done as you ask - I've no experience of gdb or MPI debugging! I've ssh-ed into each node that's running the job, and then ps ax | grep aspect then done gdb -p <firstPID> backtrace detach attach <nextPID> etc

Every time I attach, I get a lot of warnings that debug information is not found for lots of libraries.

Results are attached. debug_output.txt

The error file is empty, and the output file holds: ----------------------------------------------------------------------------- -- This is ASPECT, the Advanced Solver for Problems in Earth's ConvecTion. -- . version 1.4.0-pre -- . running in DEBUG mode -- . running with 17 MPI processes -- . using Trilinos -----------------------------------------------------------------------------

Number of active cells: 256 (on 5 levels) Number of degrees of freedom: 3,556 (2,178+289+1,089)

*\ Timestep 0: t=0 seconds Solving temperature system... 0 iterations. Rebuilding Stokes preconditioner... Solving Stokes system... 27 iterations.

Postprocessing: RMS, max velocity: 1.79 m/s, 2.53 m/s Temperature min/avg/max: 0 K, 0.5 K, 1 K Heat fluxes through boundary parts: 4.724e-06 W, -4.724e-06 W, -1 W, 1 W Writing graphical output: output/solution-00000

*\ Timestep 1: t=0.0123322 seconds Solving temperature system...

I'm not sure this is too helpful, as I can't see any process that's not stuck in the preconditioner stage - please advise if I'm doing something wrong! I will also try again, see if I can get it to hang before it prints out Timestep 1.

spco commented 8 years ago

I had a spare moment - running it again, it hangs at

*\ Timestep 4: t=0.0194467 seconds Solving temperature system... 16 iterations. Solving Stokes system... 25 iterations.

Postprocessing: RMS, max velocity: 21.5 m/s, 30.4 m/s Temperature min/avg/max: 0 K, 0.5 K, 1 K Heat fluxes through boundary parts: 6.103e-05 W, -5.949e-05 W, -1.087 W, 1.087 W

*\ Timestep 5: t=0.0204739 seconds Solving temperature system... 14 iterations. Solving Stokes system... 24 iterations.

Postprocessing: RMS, max velocity: 26.6 m/s, 37.8 m/s Temperature min/avg/max: 0 K, 0.5 K, 1 K Heat fluxes through boundary parts: 7.658e-05 W, -7.511e-05 W, -1.132 W, 1.132 W Writing graphical output: output/solution-00002

and output is here: debug_output2.txt

gassmoeller commented 8 years ago

I agree that we should make the default behaviour more resistant against system specific problems. Maybe we can make "Write in background" and "Temporary file location" input parameters, and the default behaviour is to write to the final destination directly and without using an additional thread? Then we could also dump all these fallback options in https://github.com/geodynamics/aspect/blob/master/source/postprocess/visualization.cc#L534 and simply fail with a useful error message in case something does not work.

bangerth commented 8 years ago

I'll admit that I'm confused. All processes seem to hang in Epetra_BlockMap::SameAs but that makes no sense. There must be one process that is stuck somewhere else.

Does the problem reproduce if you run only two processes, but have them run on different machines? (Most schedulers allow you to specify that you want to run only one process per node.) In other words, is the problem that you're running on multiple nodes, or that you run on more than 16 cores?

bangerth commented 8 years ago

On 02/10/2016 10:50 AM, Timo Heister wrote:

@bangerth https://github.com/bangerth you implemented the "write to tmp and mv" scheme for graphical output, right? Did you have evidence that this improved performance? I assume it was faster for you because /tmp is local whereas you wanted to write to an NFS file system?

Yes. I don't think I have the data any more, but it turned out that having 1000 processes write into the same directory of some NFS file server really brought down the system. I think this was back on the brazos cluster.

It is also easy to write files into a local directory instead of the NFS system (I create output/ as a symlink to a local directory). If files are big enough to be a problem, one would want to use MPI IO (so use grouping>0) instead anyways.

How do you find a local directory? Or do you want to suggest that users set things up that way themselves?

bangerth commented 8 years ago

On 02/10/2016 12:27 PM, Rene Gassmöller wrote:

I agree that we should make the default behaviour more resistant against system specific problems. Maybe we can make "Write in background" and "Temporary file location" input parameters, and the default behaviour is to write to the final destination directly and without using an additional thread? Then we could also dump all these fallback options in https://github.com/geodynamics/aspect/blob/master/source/postprocess/visualization.cc#L534 and simply fail with a useful error message in case something does not work. Yes, that would certainly be feasible without too much trouble.

gassmoeller commented 8 years ago

Do you want to investigate the problem further and create the PR yourself, or should I go ahead and just create the input parameters?

tjhei commented 8 years ago

How do you find a local directory? Or do you want to suggest that users set things up that way themselves?

The latter of course. On every large machine I have been on there are guidelines regarding storage and using the NFS for large IO is never advertised for good reasons. :-) Anyway, if you have 1000+ processes you better know where you are writing and if you can use MPI IO.

bangerth commented 8 years ago

@gassmoeller -- if you have time, please go ahead. I won't get to it within the next few days for sure :-(

gassmoeller commented 8 years ago

Fixed by #752. @spco, if you agree please close this issue. I just do not want to steal your issue :wink:

geodynamics / aspect

Visualization plugin hangs on some multi-node setups #749