geodynamics / aspect

A parallel, extensible finite element code to simulate convection in both 2D and 3D models.
https://aspect.geodynamics.org/
Other
217 stars 232 forks source link

Different advection system after restart #4088

Open anne-glerum opened 3 years ago

anne-glerum commented 3 years ago

Looking into an issue where after a restart the number of iterations for a Stokes solve differs greatly, I noticed that the rhs and the matrix of the advection solves also differ between a restarted timestep and the corresponding timestep of the original run.

For example, with the test mass_reaction_term.prm at higher resolution and tolerance, the norm of the rhs of the advection system differs 2 OOM after a restart in the first nonlinear iteration. I didn't expect this, and wonder if it can lead to the Stokes solver differences I see in more complex models. Should I look further into this difference?

This test also includes operator splitting, melting, iterated Advection and Stokes.

Image shows the fourth timestep for the original run on the right and the restarted run on the left with a mainline aspect apart from the edit to output the norm of the rhs and matrix. The rhs norm differs 2 OOM in the first nonlinear iteration. In the second iteration, the nr of iterations for the porosity solve also differs.

Screenshot 2021-06-23 at 14 52 19
gassmoeller commented 3 years ago

This looks like it could be a bug. Could you post a full log.txt to document versions and serial vs parallel?

This test also includes operator splitting, melting, iterated Advection and Stokes.

This would be a place to start, check if the differences happen without melting and/or operator splitting. Maybe we forget to serialize something necessary for the melt handler? Are the magnitudes comparable again at a later timestep?

anne-glerum commented 3 years ago
-----------------------------------------------------------------------------
-- This is ASPECT, the Advanced Solver for Problems in Earth's ConvecTion.
--     . version 2.3.0-pre (master, c268d7d)
--     . using deal.II 9.3.0-rc1
--     .       with 32 bit indices and vectorization level 3 (512 bits)
--     . using Trilinos 12.18.1 
--     . using p4est 2.2.0
--     . running in DEBUG mode
--     . running with 96 MPI processes
-----------------------------------------------------------------------------

-----------------------------------------------------------------------------
-- For information on how to cite ASPECT, see:
--   https://aspect.geodynamics.org/citing.html?ver=2.3.0-pre&melt=1&sha=c268d7d&src=code
-----------------------------------------------------------------------------
*** Resuming from snapshot!

Number of active cells: 40,960 (on 6 levels) 
Number of degrees of freedom: 1,372,681 (332,930+165,153+332,930+42,273+166,465+166,465+166,465)

*** Timestep 4:  t=5.85937e+13 seconds, dt=1.00419e+13 seconds
anne-glerum commented 3 years ago

I'll dig in a little deeper.

anne-glerum commented 3 years ago

Low res models (1 proc):

High res models (96 procs):

I'll check with another test without melt as well.

tjhei commented 3 years ago

Did you test if a very simple problem like convection box is also broken? Otherwise try to simplify as much as you can. I can then try to take a look.

anne-glerum commented 3 years ago

For the convection box, both the matrix and rhs are different after restart, but in the second timestep after restart, they have recovered. For the mass_reaction_term test, only the rhs is different after restart in the first iteration. See details below.

For the convection-box cookbook at Initial global refinement 6 and 1 proc and the T solve:

For the convection cookbook at Initial global refinement 7 and 4 procs:

For test mass_reaction_term and 96 procs for the porosity solve:

gassmoeller commented 3 years ago

@anne-glerum I was trying to reproduce this, but all results are identical for me. I did not check the norm of the rhs an matrix though, because I do not know where and how you computed that (could you post it?).

I attach the lightly modified convection-box cookbook I used, and the log.txt file that I got as a result. I diff'ed the statistics files before and after restart and they are identical for me. Maybe we can figure out the difference between my attached files, and the ones you tested?

convection-box.prm.txt log.txt

tjhei commented 3 years ago

Rene, are the linear solver iterations the same? Anne, is this with 9.2 and 9.3?

gassmoeller commented 3 years ago

Everything incl. linear solver iterations are the same, but I have not tested in parallel (but Anne saw differences in serial as well). Anne tested with 9.3.0-rc1 I tested with 9.3.0.

tjhei commented 3 years ago

Hm, rc1 should not matter. Maybe debug vs release?

anne-glerum commented 3 years ago

I don't see differences in linear iterations for the convection-box model either, only in the norms.

With the test mass_reaction_term on 1 proc in debug, the nonlinear residuals of the first nonlinear iteration differ and the number of linear iterations of the second porosity solve. It would be good if you could confirm that Rene, I've attached the prm. For this test, I used the old restart files instead of the most recent (i.e. copied the .old restart files to a new dir and removed the extension).

This is what I did to output the norms (I checked that all procs have the same output):

diff --git a/source/simulator/assembly.cc b/source/simulator/assembly.cc
index dccc1d5..b124cee 100644
--- a/source/simulator/assembly.cc
+++ b/source/simulator/assembly.cc
@@ -816,6 +816,8 @@ namespace aspect

     system_matrix.compress(VectorOperation::add);
     system_rhs.compress(VectorOperation::add);
+    pcout << "Norm Stokes system matrix " << system_matrix.frobenius_norm() << std::endl;
+    pcout << "Norm Stokes system rhs " << system_rhs.l2_norm() << std::endl;

     // If we change the system_rhs, matrix-free Stokes must update
     if (stokes_matrix_free)
@@ -1241,6 +1243,8 @@ namespace aspect

     system_matrix.compress(VectorOperation::add);
     system_rhs.compress(VectorOperation::add);
+    pcout << "Norm advection system matrix " << system_matrix.frobenius_norm() << std::endl;
+    pcout << "Norm advection system rhs " << system_rhs.l2_norm() << std::endl;
   }
 }

reaction_master_1proc_restart.prm.txt reaction_master_1proc.prm.txt

gassmoeller commented 3 years ago

Ok, I see it too in the model you attached. Will investigate further later today.

gassmoeller commented 3 years ago

4096 should fix this. Take care that there are two independent issues here:

  1. The difference in the norm of the rhs and matrix that you see is expected and not a bug. You compute the norm of the whole matrix and the whole rhs vector (not just the current field), and so before the restart some other blocks of the matrix and rhs may still contain information from the last timesteps that is overwritten later, while after the restart these blocks are empty. I replaced your debug output with the following, and only saw very tiny differences in the porosite field rhs afterwards:
    
    diff --git a/source/simulator/assembly.cc b/source/simulator/assembly.cc
    index 315a78f7b..c2cd5b06a 100644
    --- a/source/simulator/assembly.cc
    +++ b/source/simulator/assembly.cc
    @@ -817,6 +817,10 @@ namespace aspect
     system_matrix.compress(VectorOperation::add);
     system_rhs.compress(VectorOperation::add);
  1. There is a real bug here that causes the difference in the iteration numbers and the residuals in melt models. That should be fixed by #4096, please give that a try.

If this fixes the issue it should become part of 2.3.

anne-glerum commented 3 years ago

Thanks @gassmoeller !

  1. That makes sense, sorry for the side track.
  2. I still see some differences after a restart with a custom melt model.
    1. The nr of iterations for the field molar_Fe_in_melt is 18 instead of 17
    2. The norm of the Stokes matrix (your implementation) is 1.68268e+27 instead of 1.68267e+27
    3. The molar_Fe_in_melt residual is 1.79765e-16 instead of 1.76258e-16
    4. The Stokes residual is 0.0112888 instead of 0.011289.
  3. For the mass_reaction_term test model, it seems the nr of iterations and the norms are the same, but the residuals differ slightly after the second nonlinear iteration, e.g.
    Relative nonlinear residuals (temperature, compositional fields, Stokes system): 2.24496e-16, 5.89794e-12, 1.32699e-12, 6.18568e-11
    Relative nonlinear residuals (temperature, compositional fields, Stokes system): 2.26184e-16, 5.89793e-12, 1.32699e-12, 6.18543e-11
anne-glerum commented 3 years ago

I'll rebase my custom branch to master (and keep the linearization point fix) and try again.

gassmoeller commented 3 years ago

Can you try replacing the fix initialize_current_linearization_point(); with current_linearization_point = solution; and check if that fixes the remaining issues? Otherwise I am at a loss what is happening at the moment. However, the differences seem tiny, so maybe we can investigate further after the release.

anne-glerum commented 3 years ago

Using current_linearization_point = solution; gives identical timesteps before and after restart for both master and the rebased custom branch for the mass_reaction_term.prm. However, the custom melt model & prm with the custom branch lead to Stokes iteration nr differences of 99 vs 34. Apparently, the custom branch contains another issue.

Detailed results:

Current master, mass_reaction_term.prm on 1 proc, second NI after restart:

      Relative nonlinear residuals (temperature, compositional fields, Stokes system): 1.96848e-16, 7.05579e-06, 4.44836e-06, 0.000114977
      Relative nonlinear residuals (temperature, compositional fields, Stokes system): 1.97108e-16, 7.05579e-06, 4.44836e-06, 0.000114977

Custom melt branch rebased to current master, 1 proc: Similar residual differences as above

Master with current_linearization_point = solution; Identical timesteps.

Custom branch with current_linearization_point = solution; Identical timesteps

Custom branch + custom melt model + custom prm: Differences in nr of iterations of molar_Fe_in_melt and Stokes (99 vs 34)

anne-glerum commented 2 years ago

Hi @gassmoeller , I've returned to this issue and confirm that it still exists. If I restart a model after 2 timesteps with my custom melt branch there is a difference in the nr of inner iterations in the first nonlinear iteration. I've done the following:

In summary, the difference goes away for either of the following: 1) The mesh is uniform. 2) No particles are used. 3) When all boundaries are free slip.

I've attached the simplest prm that demonstrates the issue when restarting with the old_restart files, so that timestep 2 is repeated. Here the nr of Stokes iterations is either 20 or 21. It does require the melt branch melt_visco_plastic_PM01 and has 1.6 MDOFs.

I'm continuing with a uniform mesh form now, but I'd appreciate any suggestions to investigate this further.

PM01_SB_11_CP_p2_2timesteps_pp_nsw_nLwr_noAMR_tanlrb.txt