geodynamics / aspect

A parallel, extensible finite element code to simulate convection in both 2D and 3D models.
https://aspect.geodynamics.org/
Other
224 stars 235 forks source link

shell_simple_3d crashes #123

Closed bangerth closed 8 years ago

bangerth commented 10 years ago

When running shell_simple_3d.prm with 32 processors, it works for 16 time steps but then I get this:

*\ Timestep 15: t=7.83037e+06 years Solving temperature system... 21 iterations. Solving Stokes system... 30+14 iterations.

Postprocessing: RMS, max velocity: 0.0434 m/year, 0.162 m/year Temperature min/avg/max: 973 K, 1409 K, 1973 K Heat fluxes through boundary parts: -7.917e+12 W, 3.016e+12 W Writing depth average output/depth_average.vtu

Number of active cells: 74,324 (on 6 levels) Number of degrees of freedom: 3,053,696 (2,213,988+101,712+737,996)

*\ Timestep 16: t=8.35066e+06 years Solving temperature system... 19 iterations. Rebuilding Stokes preconditioner... Solving Stokes system... 30+19 iterations.

Postprocessing: Writing graphical output: output/solution-00008 RMS, max velocity: 0.00439 m/year, 0.0298 m/year Temperature min/avg/max: 973 K, 1408 K, 1973 K Heat fluxes through boundary parts: -7.68e+12 W, 2.922e+12 W

*\ Timestep 17: t=1.00836e+07 years Solving temperature system...


An error occurred in line <828> of file </u/bangerth/p/deal.II/3/install-bottom/include/deal.II/lac/solver_gmres.h> in function void dealii::SolverGMRES::solve(const MATRIX&, VECTOR&, const VECTOR&, const PRECONDITIONER&) [with MATRIX = dealii::TrilinosWrappers::SparseMatrix, PRECONDITIONER = dealii::TrilinosWrappers::PreconditionILU, VECTOR = dealii::TrilinosWrappers::MPI::Vector] The violated condition was: false The name and call sequence of the exception was: SolverControl::NoConvergence (this->control().last_step(), this->control().last_value()) Additional Information:

Iterative method reported convergence failure in step 737996 with residual 3.30962e+26

Does anyone else see this? I'm running with mpirun -np 32 aspect ../Aspect/cookbooks/shell_simple_3d.prm

ian-r-rose commented 10 years ago

I notice that the CFL number is rather large (1.0) for this parameter file, and that the crash happens on the second time step after a refinement. Is it possible that the bdf2 scheme is losing stability? That is to say, could the old_solution and the old_old_solution be sufficiently different across refinement levels that the quadratic extrapolation is not well behaved?

Did this cookbook work previously? A test of this theory would be (1) refining after, say, 5 timesteps instead of 15, or (2) turning off bdf2 and just using implicit Euler.

bangerth commented 10 years ago

I'm quite certain I ran the input file pretty much as is to generate the corresponding section in the manual...

tjhei commented 10 years ago

and I think we should be unconditionally stable so the CFL number should not be the problem.

This could be related to the change of the linear residual that we did.

bangerth commented 10 years ago

On 06/06/2014 01:52 PM, Timo Heister wrote:

and I think we should be unconditionally stable so the CFL number should not be the problem.

This could be related to the change of the linear residual that we did.

Good point. I've been running this on a branch here but would appreciate if anyone else could try to reproduce on their end. It's going to take a very long time because of the very large number of iterations it does in the last step -- you probably want to limit that manually...

W.


Wolfgang Bangerth email: bangerth@math.tamu.edu www: http://www.math.tamu.edu/~bangerth/

ian-r-rose commented 10 years ago

Wasn't the change to the linear residual only in the Stokes system? It appears to me that the failure is in the solve for the temperature system.

tjhei commented 10 years ago

I can reproduce the problem here (even with 4 cpus). Changing how often we do AMR doesn't dramatically change the result. I still see the crash with 2 instead of 3 initial adaptive steps (doesn't make the problem much smaller though).

tjhei commented 10 years ago

I managed to get the crash much sooner (timestep 6 with 900k DoFs): https://github.com/tjhei/aspect/blob/shell_3d_crash/cookbooks/shell_simple_3d.prm

ian-r-rose commented 10 years ago

I can confirm that Timo's simplified prm file crashes on my workstation with 4 cores.

tjhei commented 10 years ago

simplified the file some more (still timestep 6, see link in my last message). I tried removing the constraints.distribute() after refinement, but this is not the cause for the crash. The velocities look rather strange right before the crash. image

gassmoeller commented 10 years ago

I filled an hour of free-time with some tests concerning this problem, and while I have no solution, here are some observations and ways to circumvent the crash: 1) The theoretical/analytical solution of the shell_simple_3d.prm should be exactly the one we use to calculate the initial residual (v=0 , adiabatic pressure), nonetheless it seems like the stokes solver converges, although to non-useful velocities (this is not surprising though and it usually just means that numerical oscillations/errors will initiate a usual convection after a while). As soon as one provides even the slightest perturbation to the initial temperature field, everything seems to work fine (for example use the Harmonic perturbation initial condition with a perturbation magnitude of 1 K) 2) The crash only occurs with adaptive meshing, and always in the advection solver in the timestep after a timestep with remeshing. More precisely with Timo's input file it seems like the crash occurs after the first remeshing that resolves the whole inner thermal boundary layer with resolution 2. For this it is good to know that the solution to this parameter file is somewhat different with a global resolution of 1 and 2 (vrms around 4 cm/year vs. 0.5 cm/year) and the solution in the adaptive case jumps from the first one to the second one right in the timestep before the crash. Notice that the same jump in velocities occurs in Wolfgang's output above. One could argue that this is just a failed solution of the stokes solver, but it seems like the same solution you get with a global static resolution of 2, and there no crash happens. Therefore it seems to be the jump in the solution combined with the problems with the initial residual calculation for exactly this models setup is causing the trouble.

tjhei commented 10 years ago

interestingly enough, GMRES is unable to reduce the residual at all: initial residual: 2.15188e+28 rhs norm: 6.20482e+28 target tolerance: 6.20482e+16 residual in it 1: 1.98576e+28 residual after 1000 iterations: 1.98407e+28

If I switch from entropy viscosity to first order viscosity, everything is working fine. I am not sure if any of the old solution variables is wrong after mesh refinement, or if we for some reason do not apply enough stabilization.

tjhei commented 10 years ago

The entropy variation jumps up by a factor of 2 in this step and the extrapolated_advection_field_range for the temperature is 973, 2398.36 instead of 973, 1973 in every step before.

gassmoeller commented 10 years ago

Hmm, the velocity is changing a lot between the old solution and old old solution and the maximum entropy stabilization and the advection system residual is calculated with the mean velocity of the two, however the velocity is decreasing so in theory we should only have an overstabilization and not too less stabilization. With higher stabilization values for beta and c_R one can delay the crash to later time steps, but it still occurs.

tjhei commented 10 years ago

You know what the problem could be? old_old_solution is interpolated, so it might not be divergence free.

The increase of the max temperature in global_field_range is not the problem...

gassmoeller commented 10 years ago

Ok that jump in entropy variation is not surprising if the extrapolated_advection_field_range changes, because we approximate the average temperature by taking the arithmetic mean of maximal and minimal temperature, and (973+2398.36)/2 is much farther away from 1473 than (973+1973)/2. The interesting question would be why the extrapolated_advection_field_range changes. This would require having largely different old_solution and old_old_solution temperature values just because of the mesh refinement in the step before. By the way, do you know, why we use an iterated trapezoidal quadrature rule for the estimation of the extrapolated_advection_field_range? I was just confused because of the difference to the otherwise used Gauss quadrature in the assembly.

tjhei commented 10 years ago

This is how the entropy viscosity looks like when the solution fails: entropy

The trapezoidal rule evaluates the solution exactly at the support points instead of finding the min/max of the quadratic solution. Think of a function in 1d with values 0,1,1 at x=0,1,2 and a quadratic interpolation of these values. The quadratic function will overshoot and have a much large maximum. Why we do this here? I am not sure...

gassmoeller commented 10 years ago

I know you wrote further up that the time stepping scheme should be unconditionally stable, but I would like to come back to this point for a moment. As far as I can see we calculate the entropy stabilization with the old_solution, however in the assembly for the equation we use the extrapolated current_linearization_point velocities and we know the velocities have changed a lot between old_old_solution and old_solution so the current_linearization_point will be quite different from old_solution (even more so, because the velocities are much lower now, and therefore the timestep length is now 3 times larger than in the timestep before). Additionally, if I for test purposes use the old_solution to assemble the matrix the solver converges well. Maybe calculating the entropy viscosity with the old solution is not accurate enough in this particular case?

tjhei commented 9 years ago

Rene, so if you replace the extrapolated Stokes solution by old_solution in the entropy computation, it works?

ian-r-rose commented 9 years ago

Has anybody attempted to do a git bisect for this problem?

On Sun, Sep 14, 2014 at 5:08 PM, Timo Heister notifications@github.com wrote:

Rene, so if you replace the extrapolated Stokes solution by old_solution in the entropy computation, it works?

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/aspect/issues/123#issuecomment-55544146.

gassmoeller commented 9 years ago

What I did so far was replacing the current_velocities in the matrix_assembly in assembly.cc:1597 by the old velocities. This is clearly wrong (changing our timestepping just there but nowhere else), but stable. Currently we use the old_velocities for calculating the entropy viscosity, but we might test to replace it by the current velocities. I think we did not do this, because it is some kind of a nonlinearity (basing the assembly of the matrix on a extrapolated quantity)? It is however a bit more work to replace all the calls to scratch.explicit_material_model by the normal material_model, and I am currently a bit short on schedule, so I can only look into it in two weeks or so.

@ian-r-rose: I do not think somebody did a bisect yet, however I remember faintly that I tried to run it with aspect-1.1 and it was already in there. Might be wrong though.

ian-r-rose commented 9 years ago

Not a proper bisection, but I get the same broken behavior as far back as the switch to cmake. I didn't go further back due to not wanting to switch up the build system.

bangerth commented 9 years ago

I recall that the issue predates the switch to git for sure. If it even predates the switch to cmake then it's indeed pretty old. I dimly recall having done a bisection at one point in the past and found that it was a complete innocently looking patch -- it may have been the one where Timo fiddled with computing the way we compute the relative tolerance for convergence. But I really can't say for sure any more. I think finding the patch that caused this would really help nail down the problem.

tjhei commented 9 years ago

I went back all the way to 538590559 (May 2013), fought a bit to make it compile, and it fails in the same way. So I don't have a revision where my reduced code is working.

bangerth commented 9 years ago

Is that before or after the section in the manual was added? I know I ran it when I wrote the section...

tjhei commented 9 years ago

before. The file got added in Feb 2014:

$ git log --reverse -- cookbooks/shell_simple_3d.prm
commit 29c8681c39a2ea4216604892cc89470ca541be4f
Author: Wolfgang Bangerth <bangerth@math.tamu.edu>
Date:   Mon Feb 17 19:16:21 2014 +0000

    Annotate some parameter files. Add a couple of new ones. Move those that hav

What do you propose? Maybe it worked back then but only on a finer mesh?

tjhei commented 9 years ago

(I am now rerunning your original shell_3d when you checked it in, we'll see where that goes)

tjhei commented 9 years ago

Original shell_3d (b5241f39) crashes in step 17 (as W. wrote in the first message in this thread). So either it never worked with this resolution, or the changes were inside deal.II.

tjhei commented 9 years ago

good news: it looks like it is working with the recent manifold changes (my file, 42 timesteps and running). We need to do some more testing, though. :+1:

bangerth commented 9 years ago

Let's get a release out right now, quick, pronto, before anyone finds something that doesn't work! ;-)

bangerth commented 8 years ago

Confirmed to work now. Not fast, but that's because it's a big model I suppose.