Open anne-glerum opened 3 years ago
This looks like it could be a bug. Could you post a full log.txt to document versions and serial vs parallel?
This test also includes operator splitting, melting, iterated Advection and Stokes.
This would be a place to start, check if the differences happen without melting and/or operator splitting. Maybe we forget to serialize something necessary for the melt handler? Are the magnitudes comparable again at a later timestep?
-----------------------------------------------------------------------------
-- This is ASPECT, the Advanced Solver for Problems in Earth's ConvecTion.
-- . version 2.3.0-pre (master, c268d7d)
-- . using deal.II 9.3.0-rc1
-- . with 32 bit indices and vectorization level 3 (512 bits)
-- . using Trilinos 12.18.1
-- . using p4est 2.2.0
-- . running in DEBUG mode
-- . running with 96 MPI processes
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
-- For information on how to cite ASPECT, see:
-- https://aspect.geodynamics.org/citing.html?ver=2.3.0-pre&melt=1&sha=c268d7d&src=code
-----------------------------------------------------------------------------
*** Resuming from snapshot!
Number of active cells: 40,960 (on 6 levels)
Number of degrees of freedom: 1,372,681 (332,930+165,153+332,930+42,273+166,465+166,465+166,465)
*** Timestep 4: t=5.85937e+13 seconds, dt=1.00419e+13 seconds
I'll dig in a little deeper.
Low res models (1 proc):
High res models (96 procs):
I'll check with another test without melt as well.
Did you test if a very simple problem like convection box is also broken? Otherwise try to simplify as much as you can. I can then try to take a look.
For the convection box, both the matrix and rhs are different after restart, but in the second timestep after restart, they have recovered. For the mass_reaction_term test, only the rhs is different after restart in the first iteration. See details below.
For the convection-box cookbook at Initial global refinement 6 and 1 proc and the T solve:
For the convection cookbook at Initial global refinement 7 and 4 procs:
For test mass_reaction_term and 96 procs for the porosity solve:
@anne-glerum I was trying to reproduce this, but all results are identical for me. I did not check the norm of the rhs an matrix though, because I do not know where and how you computed that (could you post it?).
I attach the lightly modified convection-box cookbook I used, and the log.txt file that I got as a result. I diff'ed the statistics files before and after restart and they are identical for me. Maybe we can figure out the difference between my attached files, and the ones you tested?
Rene, are the linear solver iterations the same? Anne, is this with 9.2 and 9.3?
Everything incl. linear solver iterations are the same, but I have not tested in parallel (but Anne saw differences in serial as well). Anne tested with 9.3.0-rc1 I tested with 9.3.0.
Hm, rc1 should not matter. Maybe debug vs release?
I don't see differences in linear iterations for the convection-box model either, only in the norms.
With the test mass_reaction_term on 1 proc in debug, the nonlinear residuals of the first nonlinear iteration differ and the number of linear iterations of the second porosity solve. It would be good if you could confirm that Rene, I've attached the prm. For this test, I used the old restart files instead of the most recent (i.e. copied the .old restart files to a new dir and removed the extension).
This is what I did to output the norms (I checked that all procs have the same output):
diff --git a/source/simulator/assembly.cc b/source/simulator/assembly.cc
index dccc1d5..b124cee 100644
--- a/source/simulator/assembly.cc
+++ b/source/simulator/assembly.cc
@@ -816,6 +816,8 @@ namespace aspect
system_matrix.compress(VectorOperation::add);
system_rhs.compress(VectorOperation::add);
+ pcout << "Norm Stokes system matrix " << system_matrix.frobenius_norm() << std::endl;
+ pcout << "Norm Stokes system rhs " << system_rhs.l2_norm() << std::endl;
// If we change the system_rhs, matrix-free Stokes must update
if (stokes_matrix_free)
@@ -1241,6 +1243,8 @@ namespace aspect
system_matrix.compress(VectorOperation::add);
system_rhs.compress(VectorOperation::add);
+ pcout << "Norm advection system matrix " << system_matrix.frobenius_norm() << std::endl;
+ pcout << "Norm advection system rhs " << system_rhs.l2_norm() << std::endl;
}
}
reaction_master_1proc_restart.prm.txt reaction_master_1proc.prm.txt
Ok, I see it too in the model you attached. Will investigate further later today.
diff --git a/source/simulator/assembly.cc b/source/simulator/assembly.cc
index 315a78f7b..c2cd5b06a 100644
--- a/source/simulator/assembly.cc
+++ b/source/simulator/assembly.cc
@@ -817,6 +817,10 @@ namespace aspect
system_matrix.compress(VectorOperation::add);
system_rhs.compress(VectorOperation::add);
// If we change the system_rhs, matrix-free Stokes must update if (stokes_matrix_free) { @@ -1241,6 +1245,9 @@ namespace aspect
system_matrix.compress(VectorOperation::add); system_rhs.compress(VectorOperation::add);
If this fixes the issue it should become part of 2.3.
Thanks @gassmoeller !
Relative nonlinear residuals (temperature, compositional fields, Stokes system): 2.24496e-16, 5.89794e-12, 1.32699e-12, 6.18568e-11
Relative nonlinear residuals (temperature, compositional fields, Stokes system): 2.26184e-16, 5.89793e-12, 1.32699e-12, 6.18543e-11
I'll rebase my custom branch to master (and keep the linearization point fix) and try again.
Can you try replacing the fix initialize_current_linearization_point();
with current_linearization_point = solution;
and check if that fixes the remaining issues? Otherwise I am at a loss what is happening at the moment. However, the differences seem tiny, so maybe we can investigate further after the release.
Using current_linearization_point = solution;
gives identical timesteps before and after restart for both master and the rebased custom branch for the mass_reaction_term.prm
. However, the custom melt model & prm with the custom branch lead to Stokes iteration nr differences of 99 vs 34. Apparently, the custom branch contains another issue.
Detailed results:
Current master, mass_reaction_term.prm on 1 proc, second NI after restart:
Relative nonlinear residuals (temperature, compositional fields, Stokes system): 1.96848e-16, 7.05579e-06, 4.44836e-06, 0.000114977
Relative nonlinear residuals (temperature, compositional fields, Stokes system): 1.97108e-16, 7.05579e-06, 4.44836e-06, 0.000114977
Custom melt branch rebased to current master, 1 proc: Similar residual differences as above
Master with current_linearization_point = solution;
Identical timesteps.
Custom branch with current_linearization_point = solution;
Identical timesteps
Custom branch + custom melt model + custom prm: Differences in nr of iterations of molar_Fe_in_melt and Stokes (99 vs 34)
Hi @gassmoeller , I've returned to this issue and confirm that it still exists. If I restart a model after 2 timesteps with my custom melt branch there is a difference in the nr of inner iterations in the first nonlinear iteration. I've done the following:
current_linearization_point = solution;
in checkpoint_restart as suggested above. This changes the number of Stokes iterations before the restart, and there is still a difference with the iterations afterwards. Melt scaling factor threshold
to 1e-8, but that only converges for loose tolerances, such that at restart the Stokes iterations are 0. In summary, the difference goes away for either of the following: 1) The mesh is uniform. 2) No particles are used. 3) When all boundaries are free slip.
I've attached the simplest prm that demonstrates the issue when restarting with the old_restart files, so that timestep 2 is repeated. Here the nr of Stokes iterations is either 20 or 21. It does require the melt branch melt_visco_plastic_PM01 and has 1.6 MDOFs.
I'm continuing with a uniform mesh form now, but I'd appreciate any suggestions to investigate this further.
Looking into an issue where after a restart the number of iterations for a Stokes solve differs greatly, I noticed that the rhs and the matrix of the advection solves also differ between a restarted timestep and the corresponding timestep of the original run.
For example, with the test
mass_reaction_term.prm
at higher resolution and tolerance, the norm of the rhs of the advection system differs 2 OOM after a restart in the first nonlinear iteration. I didn't expect this, and wonder if it can lead to the Stokes solver differences I see in more complex models. Should I look further into this difference?This test also includes operator splitting, melting, iterated Advection and Stokes.
Image shows the fourth timestep for the original run on the right and the restarted run on the left with a mainline aspect apart from the edit to output the norm of the rhs and matrix. The rhs norm differs 2 OOM in the first nonlinear iteration. In the second iteration, the nr of iterations for the porosity solve also differs.