geodynamics / aspect

A parallel, extensible finite element code to simulate convection in both 2D and 3D models.
https://aspect.geodynamics.org/
Other
223 stars 235 forks source link

Problems with tracers and MPI #160

Closed maxrudolph closed 8 years ago

maxrudolph commented 10 years ago

I've encountered crashes related to using tracers and MPI with two or more processes. Attached, you'll find a simple parameter file that will produce an error after several hundred (800-900) timesteps when run with two MPI processes. This occurred on ubuntu 14.04 and also on OS X. I compiled deal.ii and ASPECT with clang 3.4-lubuntu3. I'm using openmpi bundled with ubuntu (which uses gcc 4.8.2 by default but I've used the OMPI_CXX environment variable to make the wrappers use clang because compiling deal.ii results in internal compiler errors with gcc).

------- Input file:

# At the top, we define the number of space dimensions we would like to
# work in:
set Dimension                              = 2

# There are several global variables that have to do with what
# time system we want to work in and what the end time is. We
# also designate an output directory.
set Use years in output instead of seconds = true
set End time                               = 3e9
set Output directory                       = output
set Resume computation             = false

# Then come a number of sections that deal with the setup
# of the problem to solve. The first one deals with the
# geometry of the domain within which we want to solve.
# The sections that follow all have the same basic setup
# where we select the name of a particular model (here,
# the box geometry) and then, in a further subsection,
# set the parameters that are specific to this particular
# model.
subsection Geometry model
  set Model name = box
  subsection Box
    set X periodic = false
    set X extent = 4.2e6
    set Y extent = 3e6
  end
end

# The following section deals with the discretization of
# this problem, namely the kind of mesh we want to compute
# on. We here use a globally refined mesh without
# adaptive mesh refinement.
subsection Mesh refinement
  set Initial global refinement                = 3
  set Initial adaptive refinement              = 2
  set Strategy                         = temperature
  set Time steps between mesh refinement       = 3
  set Refinement fraction                  = 0.3
  set Coarsening fraction              = 0.05
end

# The following two sections describe first the
# direction (vertical) and magnitude of gravity and the
# material model (i.e., density, viscosity, etc).
subsection Gravity model
  set Model name = vertical
  subsection Vertical
    set Magnitude = 9.81
  end
end

subsection Material model
   set Model name = simple
   subsection Simple model
     set Viscosity                     = 1.0E22
     set Thermal viscosity exponent    = 4.60517
     set Reference temperature         = 1250
     set Reference density             = 3300
  end
end

#7.38e-12 W/kg yields mantle heat production of 22 TW
subsection Heating model
    set Model name = constant heating
    subsection Constant heating
        set Radiogenic heating rate = 7.38e-12      
    end 
end

# The next section deals with the initial conditions for the
# temperature (there are no initial conditions for the
# velocity variable since the velocity is assumed to always
# be in a static equilibrium with the temperature field).
# There are a number of models with the 'function' model
# a generic one that allows us to enter the actual initial
# conditions in the form of a formula that can contain
# constants. We choose a linear temperature profile that
# matches the boundary conditions defined below plus
# a small perturbation:
subsection Initial conditions
  set Model name = function
  subsection Function
    set Variable names      = x,y
    set Function constants  = p=-0.01, L=4.2e6, D=3e6, pi=3.1415926536, k=1, T_top=0, T_bottom=2500
    set Function expression = T_top + (T_bottom-T_top)*(1-(y/D) - p*sin(k*pi*x/L)*sin(pi*y/D))
  end
end

# We then also have to prescribe several other parts of the model
# such as which boundaries actually carry a prescribed boundary
# temperature (as described in the documentation of the `box'
# geometry, boundaries 2 and 3 are the bottom and top boundaries)
# whereas all other parts of the boundary are insulated (i.e.,
# no heat flux through these boundaries; this is also often used
# to specify symmetry boundaries).
subsection Model settings
  set Fixed temperature boundary indicators   = 2,3

  # The next parameters then describe on which parts of the
  # boundary we prescribe a zero or nonzero velocity and
  # on which parts the flow is allowed to be tangential.
  # Here, all four sides of the box allow tangential
  # unrestricted flow but with a zero normal component:
  set Zero velocity boundary indicators       =
  set Prescribed velocity boundary indicators =
  set Tangential velocity boundary indicators = 0,1,2,3
  set Remove nullspace = net x translation  

  # The final part of this section describes whether we
  # want to include adiabatic heating (from a small
  # compressibility of the medium) or from shear friction,
  # as well as the rate of internal heating. We do not
  # want to use any of these options here:
  set Include adiabatic heating               = false
  set Include shear heating                   = false
end

# Then follows a section that describes the boundary conditions
# for the temperature. The model we choose is called 'box' and
# allows to set a constant temperature on each of the four sides
# of the box geometry. In our case, we choose something that is
# heated from below and cooled from above. (As will be seen
# in the next section, the actual temperature prescribed here
# at the left and right does not matter.)
subsection Boundary temperature model
  set Model name = box
  subsection Box
    set Bottom temperature = 2500
    set Top temperature    = 0
  end
end

# The final part is to specify what ASPECT should do with the
# solution once computed at the end of every time step. The
# process of evaluating the solution is called `postprocessing'
# and we choose to compute velocity and temperature statistics,
# statistics about the heat flux through the boundaries of the
# domain, and to generate graphical output files for later
# visualization. These output files are created every time
# a time step crosses time points separated by 1e7 years.
subsection Postprocess
  set List of postprocessors = velocity statistics, temperature statistics, heat flux statistics , visualization, tracers, basic statistics
  subsection Visualization
    set Time between graphical output = 1e6
    set Output format = hdf5
    set List of output variables =  viscosity, density
  end
  subsection Tracers
    set Number of tracers = 1000
    set Time between data output = 1e6
    set Data output format = hdf5
  end
end

subsection Checkpointing
  set Steps between checkpoint = 200
end
------- Error messages

*** Timestep 37:  t=1.78152e+07 years
   Solving temperature system... 16 iterations.
   Rebuilding Stokes preconditioner...
   Solving Stokes system... 30+5 iterations.

   Postprocessing:
     RMS, max velocity:                  0.162 m/year, 0.56 m/year
     Temperature min/avg/max:            0 K, 1270 K, 2504 K
     Heat fluxes through boundary parts: 238 W, 302.3 W, -1.821e+05 W, 1.72e+04 W
     Advecting particles:                done

*** Timestep 38:  t=1.78973e+07 years
   Solving temperature system... 16 iterations.
   Rebuilding Stokes preconditioner...
   Solving Stokes system... 30+5 iterations.

   Postprocessing:

----------------------------------------------------
Exception on MPI process <1> while running postprocessor <
N6aspect11Postprocess14PassiveTracersILi2EEE>: 

----------------------------------------------------
Exception on MPI process <0
--------------------------------------------------------
An error occurred in line <732> of file </opt/aspect/include/aspect/particle/world.h> in function
    void aspect::Particle::World<2, aspect::Particle::BaseParticle<2> >::check_particle_count() [dim = 2, T = aspect::Particle::BaseParticle<2>]
The violated condition was: 
    global_particles==global_num_particles
The name and call sequence of the exception was:
    ExcMessage ("Particle count unexpectedly changed.")
Additional Information: 
Particle count unexpectedly changed.
--------------------------------------------------------
> while running postprocessor <
N6aspect11Postprocess14PassiveTracersILi2EEE>: 
Aborting!

--------------------------------------------------------
An error occurred in line <732> of file </opt/aspect/include/aspect/particle/world.h> in function
    void aspect::Particle::World<2, aspect::Particle::BaseParticle<2> >::check_particle_count() [dim = 2, T = aspect::Particle::BaseParticle<2>]
The violated condition was: 
    global_particles==global_num_particles
The name and call sequence of the exception was:
    ExcMessage ("Particle count unexpectedly changed.")
Additional Information: 
Particle count unexpectedly changed.
--------------------------------------------------------

Aborting!
----------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
----------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[merckx:28810] *** Process received signal ***
[merckx:28810] Signal: Aborted (6)
[merckx:28810] Signal code:  (-6)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[merckx:28811] *** Process received signal ***
[merckx:28811] Signal: Aborted (6)
[merckx:28811] Signal code:  (-6)
[merckx:28810] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36ff0) [0x7f59e7e07ff0]
[merckx:28810] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7f59e7e07f79]
[merckx:28810] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f59e7e0b388]
[merckx:28810] [ 3] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_abort_verbose+0) [0x7f59e9149c09]
[merckx:28810] [ 4] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_abort_verbosef+0) [0x7f59e9149c9e]
[merckx:28810] [ 5] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_abort_collective+0) [0x7f59e9149dbb]
[merckx:28810] [ 6] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_abort_verbosev+0) [0x7f59e9149d46]
[merckx:28810] [ 7] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_memory_check+0xc1) [0x7f59e91494e2]
[merckx:28810] [ 8] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_package_unregister+0x3b) [0x7f59e914a0f6]
[merckx:28810] [ 9] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_finalize+0x3c) [0x7f59e914a69a]
[merckx:28810] [10] /usr/local/deal.II-dev/lib/libdeal_II.g.so.8.1.0(_ZN6dealii8internal5p4est12InitFinalize9SingletonD2Ev+0x2c) [0x7f59f0d6d84c]
[merckx:28810] [11] /lib/x86_64-linux-gnu/libc.so.6(+0x3c509) [0x7f59e7e0d509]
[merckx:28810] [12] /lib/x86_64-linux-gnu/libc.so.6(+0x3c555) [0x7f59e7e0d555]
[merckx:28810] [13] /usr/lib/libmpi.so.1(orte_ess_base_app_abort+0x20) [0x7f59e7b0ac00]
[merckx:28810] [14] /usr/lib/libmpi.so.1(+0xba2a9) [0x7f59e7b0a2a9]
[merckx:28810] [15] /usr/lib/libmpi.so.1(ompi_mpi_abort+0x249) [0x7f59e7aa9b69]
[merckx:28810] [16] ./aspect(_ZN6aspect11Postprocess7ManagerILi2EE7executeERN6dealii12TableHandlerE+0x3ac) [0xb1a68c]
[merckx:28810] [17] ./aspect(_ZN6aspect9SimulatorILi2EE11postprocessEv+0xe3) [0xa3e513]
[merckx:28810] [18] ./aspect(_ZN6aspect9SimulatorILi2EE3runEv+0x705) [0xa3a335]
[merckx:28810] [19] ./aspect(main+0x53b) [0xbe0a4b]
[merckx:28810] [20] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f59e7df2ec5]
[merckx:28810] [21] ./aspect() [0x837f26]
[merckx:28810] *** End of error message ***
[merckx:28811] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36ff0) [0x7f4b9de33ff0]
[merckx:28811] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7f4b9de33f79]
[merckx:28811] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f4b9de37388]
[merckx:28811] [ 3] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_abort_verbose+0) [0x7f4b9f175c09]
[merckx:28811] [ 4] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_abort_verbosef+0) [0x7f4b9f175c9e]
[merckx:28811] [ 5] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_abort_collective+0) [0x7f4b9f175dbb]
[merckx:28811] [ 6] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_abort_verbosev+0) [0x7f4b9f175d46]
[merckx:28811] [ 7] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_memory_check+0xc1) [0x7f4b9f1754e2]
[merckx:28811] [ 8] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_package_unregister+0x3b) [0x7f4b9f1760f6]
[merckx:28811] [ 9] /opt/p4est-0.3.4.2/DEBUG/lib/libsc.so.0(sc_finalize+0x3c) [0x7f4b9f17669a]
[merckx:28811] [10] /usr/local/deal.II-dev/lib/libdeal_II.g.so.8.1.0(_ZN6dealii8internal5p4est12InitFinalize9SingletonD2Ev+0x2c) [0x7f4ba6d9984c]
[merckx:28811] [11] /lib/x86_64-linux-gnu/libc.so.6(+0x3c509) [0x7f4b9de39509]
[merckx:28811] [12] /lib/x86_64-linux-gnu/libc.so.6(+0x3c555) [0x7f4b9de39555]
[merckx:28811] [13] /usr/lib/libmpi.so.1(orte_ess_base_app_abort+0x20) [0x7f4b9db36c00]
[merckx:28811] [14] /usr/lib/libmpi.so.1(+0xba2a9) [0x7f4b9db362a9]
[merckx:28811] [15] /usr/lib/libmpi.so.1(ompi_mpi_abort+0x249) [0x7f4b9dad5b69]
[merckx:28811] [16] ./aspect(_ZN6aspect11Postprocess7ManagerILi2EE7executeERN6dealii12TableHandlerE+0x3ac) [0xb1a68c]
[merckx:28811] [17] ./aspect(_ZN6aspect9SimulatorILi2EE11postprocessEv+0xe3) [0xa3e513]
[merckx:28811] [18] ./aspect(_ZN6aspect9SimulatorILi2EE3runEv+0x705) [0xa3a335]
[merckx:28811] [19] ./aspect(main+0x53b) [0xbe0a4b]
[merckx:28811] [20] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f4b9de1eec5]
[merckx:28811] [21] ./aspect() [0x837f26]
[merckx:28811] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 28810 on node merckx exited on signal 6 (Aborted).
--------------------------------------------------------------------------
[merckx:28809] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[merckx:28809] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
max@merckx:~/aspect$ 
gassmoeller commented 10 years ago

I have talked to Louise about this, and she will forward the problem to Eric, who has the most experience on the tracer code, so maybe there will be some improvement in the near future.

ian-r-rose commented 10 years ago

At the very least we may want to remove the assertion that the particle count never changes. I'm not sure it is realistic to expect that we never lose a tracer, especially where velocity gradients are sharp.

gassmoeller commented 8 years ago

This should finally be closed with #411. @maxrudolph do you want to check? I never encountered these problems during my recent tests on more than a thousand cores. In the new version particles are allowed to get lost, although it almost never happens. The number of particles is tracked in the statistics file.

maxrudolph commented 8 years ago

@gassmoeller Using the development versions of aspect and deal.ii, it appears that this problem is solved! Thanks for contributing your vastly improved tracer code!