FluidityProject / fluidity

Fluidity
http://fluidity-project.org
Other
365 stars 115 forks source link

PETSc did not converge for matrix solve #322

Closed LeiZ2 closed 3 years ago

LeiZ2 commented 3 years ago

A bit background: I am a Fluidity user. I use Fluidity to run subduction (geodynamic) models based on Garel et al., 2014. The error is met when I am running parallel with 30 cores with Fluidity 4.1.15.

fluidity.err message:

WARNING: Failed to converge. PETSc did not converge for matrix solve of: MaterialVolumeFraction Reason for non-convergence is undefined: -11 Number of iterations: 0 Sending signal to dump and finish Dumping matrix equation in file called matrixdump

Diagnose: I ran “petsc_readnsolve subduction.flml MaterialVolumeFraction -l” and got the following messge:

For multi-material/phase you need to provide the material_phase and the field name, e.g. SolidPhase::Pressure ERROR Error message: Missing material_phase name

I am sure the material_phase name is already there. So is the field name. I have attcached the list of material_phase name and field name defined in the *.flml. image

Solution that used to "fix" the problem (not sure if that's fixed, there is just no error message any more...) This is a quite common error message that I met running this series of models. I have some experience handling this before when the simulation was less complex. Method 1) Produce a mesh with finer resolution by turning on in a serial running first. Then, use this adapted mesh to run in parallel. This method usually works better than Method 2. Method 2) Alternatively, reduce the total cores used in parallel running, eg. from 40 cores reduce to 10 cores. This usually delays the error message and sometimes "kill" the error.

But this time when I increased the complexity of the model, both methods failed. I tried Method 1, the quality of adapted mesh was just too poor to make a difference. The mesh produced by serial running stopped adapting somehow and there was no error message about it. The first_timestep_mesh is attached. image

Then I tried Method 2, i.e. reducing the CPU cores from 40 to 10. It just delayed the error message for some output timesteps, and the output frequency was not ideal.

My questions are: -could someone help me diagnose the error message? In what circumstances would the name of material_phase or the fields be missed during parallel running? -if method 1/2 somehow makes sense? any other possilbe solutions?

gnikit commented 3 years ago

Not sure if you have looked into this but the reason PETSc is failing is because the preconditioner you have chosen. Have you tried another PC or better yet add -ksp_error_if_not_converged to your PETSc options? see: https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/KSP/KSPConvergedReason.html

Reasons why a preconditioner fails according to PETSc are:

It was not possible to build or use the requested preconditioner. This is usually due to a zero pivot in a factorization. It can also result from a failure in a subpreconditioner inside a nested preconditioner such as PCFIELDSPLIT.

see: https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/KSP/KSP_DIVERGED_PC_FAILED.html#KSP_DIVERGED_PC_FAILED

If changing your solver and/or preconditioner does not fix the issue, could you attach the .flml for us to replicate?

gnikit commented 3 years ago

From the looks of it you are using an older (Python 2) version of Fluidity which we don't develop anymore since Python 2 has been deprecated in later Ubuntu releases see Fluidity release 4.1.16. All that is to say that I don't think I will be able to replicate the issue locally.

That being said could I ask you to

export PETSC_OPTIONS="-ksp_error_if_not_converged"

and then run petsc_readnsolve for MaterialVolumeFraction. Could I also ask you to set your solver to gmres with mg or ilu as the preconditioner.

Then post the error message that PETSc throws in your .log/.err file (not sure where it will be placed)

LeiZ2 commented 3 years ago

Hi @gnikit, many thanks for following this issue. It seems the command PETSC_OPTIONS="-ksp_error_if_not_converged" is not available in the server where I run fluidity/4.1.15. I have attached a screenshot of the module list and commands executable started with petsc for you to diagnose. image

Anyway, I tried to input export PETSC_OPTIONS="-ksp_error_if_not_converged" and then run petsc_readnsolve for MaterialVolumeFractionwith the solver and two preconditioners you mentioned separately. The petsc_readnsolve.err messages reported are both still:

For multi-material/phase you need to provide the material_phase and the field name, e.g. SolidPhase::Pressure ERROR Error message: Missing material_phase name

You mentioned the fluidity version. Does that mean setup in the flml file is significantly different in the later version of fluidity? If we, unfortunately, do not find an effective solution for the error, do you think that it is necessary for me to build the latest version of fluidity and rebuild flml with diamond therein?