anne-glerum commented 3 years ago

Looking into an issue where after a restart the number of iterations for a Stokes solve differs greatly, I noticed that the rhs and the matrix of the advection solves also differ between a restarted timestep and the corresponding timestep of the original run.

For example, with the test mass_reaction_term.prm at higher resolution and tolerance, the norm of the rhs of the advection system differs 2 OOM after a restart in the first nonlinear iteration. I didn't expect this, and wonder if it can lead to the Stokes solver differences I see in more complex models. Should I look further into this difference?

This test also includes operator splitting, melting, iterated Advection and Stokes.

Image shows the fourth timestep for the original run on the right and the restarted run on the left with a mainline aspect apart from the edit to output the norm of the rhs and matrix. The rhs norm differs 2 OOM in the first nonlinear iteration. In the second iteration, the nr of iterations for the porosity solve also differs.

gassmoeller commented 3 years ago

This looks like it could be a bug. Could you post a full log.txt to document versions and serial vs parallel?

This test also includes operator splitting, melting, iterated Advection and Stokes.

This would be a place to start, check if the differences happen without melting and/or operator splitting. Maybe we forget to serialize something necessary for the melt handler? Are the magnitudes comparable again at a later timestep?

anne-glerum commented 3 years ago

-----------------------------------------------------------------------------
-- This is ASPECT, the Advanced Solver for Problems in Earth's ConvecTion.
--     . version 2.3.0-pre (master, c268d7d)
--     . using deal.II 9.3.0-rc1
--     .       with 32 bit indices and vectorization level 3 (512 bits)
--     . using Trilinos 12.18.1 
--     . using p4est 2.2.0
--     . running in DEBUG mode
--     . running with 96 MPI processes
-----------------------------------------------------------------------------

-----------------------------------------------------------------------------
-- For information on how to cite ASPECT, see:
--   https://aspect.geodynamics.org/citing.html?ver=2.3.0-pre&melt=1&sha=c268d7d&src=code
-----------------------------------------------------------------------------
*** Resuming from snapshot!

Number of active cells: 40,960 (on 6 levels) 
Number of degrees of freedom: 1,372,681 (332,930+165,153+332,930+42,273+166,465+166,465+166,465)

*** Timestep 4:  t=5.85937e+13 seconds, dt=1.00419e+13 seconds

anne-glerum commented 3 years ago

I'll dig in a little deeper.

anne-glerum commented 3 years ago

Low res models (1 proc):

On one proc with a smaller problem size, the norm of the rhs still differs ~1 OOM.
When melt transport is turned off, both the rhs and the matrix differ, by 17 and 2 OOM, resp.
Similar differences when operator splitting is turned off

High res models (96 procs):

The number of Stokes iterations after restart differs, and then the timestep size differs.
In the second timestep after restart, the norm of matrix and rhs differ only after the comma.

I'll check with another test without melt as well.

tjhei commented 3 years ago

Did you test if a very simple problem like convection box is also broken? Otherwise try to simplify as much as you can. I can then try to take a look.

anne-glerum commented 3 years ago

For the convection box, both the matrix and rhs are different after restart, but in the second timestep after restart, they have recovered. For the mass_reaction_term test, only the rhs is different after restart in the first iteration. See details below.

For the convection-box cookbook at Initial global refinement 6 and 1 proc and the T solve:

original timestep 4: Norm advection system matrix 1269.29 Norm advection system rhs 49.9594
restart at timestep 4: Norm advection system matrix 0.47286 Norm advection system rhs 0.0156265
restart timestep 5: Norm advection system matrix 1269.29 Norm advection system rhs 49.9591

For the convection cookbook at Initial global refinement 7 and 4 procs:

original timestep 19 (&20): Norm advection system matrix 2535.69 Norm advection system rhs 25.0195
restart at timestep 19: Norm advection system matrix 0.2356 Norm advection system rhs 0.00619465
restart timestep 20: Norm advection system matrix 2535.69 Norm advection system rhs 25.0195

For test mass_reaction_term and 96 procs for the porosity solve:

original timestep 40: First nonlinear iteration: Norm advection system matrix 2.63238e+14 Norm advection system rhs 3.58044e+19 Second nonlinear iteration: Norm advection system matrix 1.2622e+43 Norm advection system rhs 3.58043e+19
restart at timestep 40: First NL iteration: Norm advection system matrix 2.63238e+14 Norm advection system rhs 2.59769e+17 Second nonlinear iteration: Norm advection system matrix 1.26165e+43 Norm advection system rhs 3.58043e+19
restart timestep 41, first NL iteration: Norm advection system matrix 3.64846e+14 Norm advection system rhs 3.58047e+19

gassmoeller commented 3 years ago

@anne-glerum I was trying to reproduce this, but all results are identical for me. I did not check the norm of the rhs an matrix though, because I do not know where and how you computed that (could you post it?).

I attach the lightly modified convection-box cookbook I used, and the log.txt file that I got as a result. I diff'ed the statistics files before and after restart and they are identical for me. Maybe we can figure out the difference between my attached files, and the ones you tested?

convection-box.prm.txt log.txt

tjhei commented 3 years ago

Rene, are the linear solver iterations the same? Anne, is this with 9.2 and 9.3?

gassmoeller commented 3 years ago

Everything incl. linear solver iterations are the same, but I have not tested in parallel (but Anne saw differences in serial as well). Anne tested with 9.3.0-rc1 I tested with 9.3.0.

tjhei commented 3 years ago

Hm, rc1 should not matter. Maybe debug vs release?

anne-glerum commented 3 years ago

I don't see differences in linear iterations for the convection-box model either, only in the norms.

With the test mass_reaction_term on 1 proc in debug, the nonlinear residuals of the first nonlinear iteration differ and the number of linear iterations of the second porosity solve. It would be good if you could confirm that Rene, I've attached the prm. For this test, I used the old restart files instead of the most recent (i.e. copied the .old restart files to a new dir and removed the extension).

This is what I did to output the norms (I checked that all procs have the same output):

diff --git a/source/simulator/assembly.cc b/source/simulator/assembly.cc
index dccc1d5..b124cee 100644
--- a/source/simulator/assembly.cc
+++ b/source/simulator/assembly.cc
@@ -816,6 +816,8 @@ namespace aspect

     system_matrix.compress(VectorOperation::add);
     system_rhs.compress(VectorOperation::add);
+    pcout << "Norm Stokes system matrix " << system_matrix.frobenius_norm() << std::endl;
+    pcout << "Norm Stokes system rhs " << system_rhs.l2_norm() << std::endl;

     // If we change the system_rhs, matrix-free Stokes must update
     if (stokes_matrix_free)
@@ -1241,6 +1243,8 @@ namespace aspect

     system_matrix.compress(VectorOperation::add);
     system_rhs.compress(VectorOperation::add);
+    pcout << "Norm advection system matrix " << system_matrix.frobenius_norm() << std::endl;
+    pcout << "Norm advection system rhs " << system_rhs.l2_norm() << std::endl;
   }
 }

reaction_master_1proc_restart.prm.txt reaction_master_1proc.prm.txt

gassmoeller commented 3 years ago

Ok, I see it too in the model you attached. Will investigate further later today.

gassmoeller commented 3 years ago

4096 should fix this. Take care that there are two independent issues here:

The difference in the norm of the rhs and matrix that you see is expected and not a bug. You compute the norm of the whole matrix and the whole rhs vector (not just the current field), and so before the restart some other blocks of the matrix and rhs may still contain information from the last timesteps that is overwritten later, while after the restart these blocks are empty. I replaced your debug output with the following, and only saw very tiny differences in the porosite field rhs afterwards:
```
diff --git a/source/simulator/assembly.cc b/source/simulator/assembly.cc
index 315a78f7b..c2cd5b06a 100644
--- a/source/simulator/assembly.cc
+++ b/source/simulator/assembly.cc
@@ -817,6 +817,10 @@ namespace aspect
 system_matrix.compress(VectorOperation::add);
 system_rhs.compress(VectorOperation::add);
```

const unsigned int block_idx = 0;
pcout << "Norm Stokes system matrix " << system_matrix.block(block_idx,block_idx).frobenius_norm() << std::endl;
pcout << "Norm Stokes system rhs " << system_rhs.block(block_idx).l2_norm() << std::endl;
// If we change the system_rhs, matrix-free Stokes must update if (stokes_matrix_free) { @@ -1241,6 +1245,9 @@ namespace aspect

system_matrix.compress(VectorOperation::add); system_rhs.compress(VectorOperation::add);
pcout << "Norm advection system matrix " << system_matrix.block(block_idx,block_idx).frobenius_norm() << std::endl;
pcout << "Norm advection system rhs " << system_rhs.block(block_idx).l2_norm() << std::endl; } }

There is a real bug here that causes the difference in the iteration numbers and the residuals in melt models. That should be fixed by #4096, please give that a try.

If this fixes the issue it should become part of 2.3.

anne-glerum commented 3 years ago

Thanks @gassmoeller !

That makes sense, sorry for the side track.
I still see some differences after a restart with a custom melt model.
1. The nr of iterations for the field molar_Fe_in_melt is 18 instead of 17
2. The norm of the Stokes matrix (your implementation) is 1.68268e+27 instead of 1.68267e+27
3. The molar_Fe_in_melt residual is 1.79765e-16 instead of 1.76258e-16
4. The Stokes residual is 0.0112888 instead of 0.011289.

For the mass_reaction_term test model, it seems the nr of iterations and the norms are the same, but the residuals differ slightly after the second nonlinear iteration, e.g.

Relative nonlinear residuals (temperature, compositional fields, Stokes system): 2.24496e-16, 5.89794e-12, 1.32699e-12, 6.18568e-11
Relative nonlinear residuals (temperature, compositional fields, Stokes system): 2.26184e-16, 5.89793e-12, 1.32699e-12, 6.18543e-11

anne-glerum commented 3 years ago

I'll rebase my custom branch to master (and keep the linearization point fix) and try again.

gassmoeller commented 3 years ago

Can you try replacing the fix initialize_current_linearization_point(); with current_linearization_point = solution; and check if that fixes the remaining issues? Otherwise I am at a loss what is happening at the moment. However, the differences seem tiny, so maybe we can investigate further after the release.

anne-glerum commented 3 years ago

Using current_linearization_point = solution; gives identical timesteps before and after restart for both master and the rebased custom branch for the mass_reaction_term.prm. However, the custom melt model & prm with the custom branch lead to Stokes iteration nr differences of 99 vs 34. Apparently, the custom branch contains another issue.

Detailed results:

Current master, mass_reaction_term.prm on 1 proc, second NI after restart:

      Relative nonlinear residuals (temperature, compositional fields, Stokes system): 1.96848e-16, 7.05579e-06, 4.44836e-06, 0.000114977
      Relative nonlinear residuals (temperature, compositional fields, Stokes system): 1.97108e-16, 7.05579e-06, 4.44836e-06, 0.000114977

Custom melt branch rebased to current master, 1 proc: Similar residual differences as above

Master with current_linearization_point = solution; Identical timesteps.

Custom branch with current_linearization_point = solution; Identical timesteps

Custom branch + custom melt model + custom prm: Differences in nr of iterations of molar_Fe_in_melt and Stokes (99 vs 34)

anne-glerum commented 2 years ago

Hi @gassmoeller , I've returned to this issue and confirm that it still exists. If I restart a model after 2 timesteps with my custom melt branch there is a difference in the nr of inner iterations in the first nonlinear iteration. I've done the following:

Rebased my custom branch (https://github.com/anne-glerum/aspect/tree/melt_visco_plastic_PM01) to main and installed it based on a recent deal.ii dev version.
Ran boukare_bulk_composition.prm and boukare_latent_heat.prm at higher resolution (on 3 nodes) and restarted at different timesteps: timesteps are identical before and after restart.
Tested my old prm file: the nr of inner iterations for the molar_Fe_in_melt and Stokes solves are still different before and after restart of Timestep 2.
Tested with current_linearization_point = solution; in checkpoint_restart as suggested above. This changes the number of Stokes iterations before the restart, and there is still a difference with the iterations afterwards.
Simplified the prm file: when I remove the strain weakening and the strain that I tracked on particles, the timesteps are identical before and after restart.
When I add in the particles as purely passive, tracking only pressure and temperature, the difference occurs again. This I think is strange. The minimum pressure on the particles also differs.
Adding passive particles to the boukare_bulk_composition.prm does not lead to restart differences. I've tried setting Melt scaling factor threshold to 1e-8, but that only converges for loose tolerances, such that at restart the Stokes iterations are 0.
I've removed the initial temperature and composition plugins that required additional shared libs. Still a difference.
I've removed AMR, and fixed T and ci on outflow boundaries. Timesteps are the same.
With all boundaries tangential (so no free surface and no prescribed velocities) timesteps are the same. The overall velocities drop several orders of magnitude obviously, not sure if that's relevant. I cannot use all free slip boundaries in the original setup, it doesn't converge.
With a free surface, but otherwise free slip boundaries, still difference in nr of Stokes iterations in the first nonlinear iteration.
Using the initial model setup, but with a constant, uniform mesh, the timesteps before and after restart are the same.
I've added passive particles and a free surface to the global_melt.prm cookbook, but it does not show a difference after the first few timesteps.

In summary, the difference goes away for either of the following: 1) The mesh is uniform. 2) No particles are used. 3) When all boundaries are free slip.

I've attached the simplest prm that demonstrates the issue when restarting with the old_restart files, so that timestep 2 is repeated. Here the nr of Stokes iterations is either 20 or 21. It does require the melt branch melt_visco_plastic_PM01 and has 1.6 MDOFs.

I'm continuing with a uniform mesh form now, but I'd appreciate any suggestions to investigate this further.

PM01_SB_11_CP_p2_2timesteps_pp_nsw_nLwr_noAMR_tanlrb.txt

geodynamics / aspect

Different advection system after restart #4088

4096 should fix this. Take care that there are two independent issues here: