test_cb05 CVODE convergence fails when MPI=OFF and different rates

cguzman95 commented 4 years ago

Hi @mattldawson,

Let me put in context: This error can be easily view through the branch chem_mod_testcb05_monarch. This branch adds different photo_rates to test_cb05 (extracted from a monarch experiment), and has also an extra test_cb05 file to test_cb05 with all the monarch input values (same photo_Rates, temp, press, timestep and concs).

CMake flags::

cmake -D CMAKE_C_COMPILER=gcc \
-D CMAKE_BUILD_TYPE=debug \
-D CMAKE_C_FLAGS_DEBUG="-g" \
-D CMAKE_Fortran_FLAGS_DEBUG="-g" \
-D CMAKE_Fortran_COMPILER=mpifort \
-D ENABLE_JSON=ON \
-D ENABLE_SUNDIALS=ON \
-D ENABLE_TESTS=OFF \
-D ENABLE_GPU=OFF \
-D ENABLE_DEBUG=OFF \
-D FAILURE_DETAIL=OFF \
-D ENABLE_CXX=OFF \
-D ENABLE_MPI=ON \
..

Then, testing test_cb05 with:

Monarch photo_rates
MPI=ON
i_repeat=1, NUM_TIME_STEPS=1

It converges with a expected difference respect on EBI.

But using same config and MPI=OFF:

I'm not sure if is a error from test_cb05 or from CAMP.

It also happens by running the file test_cb05_monarch (wich has the complete monarch config). As an extra detail (maybe this is produced by another bug or by the same one, I'm not sure), using this config with MPI=ON it takes a lot on converge the first time-step (~3 seconds) :

mattldawson commented 4 years ago

Hi @cguzman95,

For the MPI=OFF tests are you still compiling with mpiifort?

My guess is that there could be a bug in the test. A couple things to try/note:

I'm pretty sure the "corrector convergence failed repeatedly" error is coming from KPP or EBI - that does not look like a CVODE error. My guess is that it's KPP failing given that the KPP rates are NaN
I would plot the results with the gnuplot scripts for each of these scenarios and look at the profiles for the species that are triggering warnings/errors
Are you sure that the new rates/conditions you are using are getting into KPP and EBI correctly? KPP doesn't have any DO_MPI flags in its code; there could be DO_MPI flags in the cb05 test code that maybe are affecting how the new initial conditions you have added are getting to KPP?

cguzman95 commented 4 years ago

For the MPI=OFF tests are you still compiling with mpiifort?

Yes.

Are you sure that the new rates/conditions you are using are getting into KPP and EBI correctly?

In ebi yes, in KPP I'm not sure. But CAMP and KPP should be independent. It could be an error on KPP that stops the execution of CAMP? But it's strange that only appears when MPI is ON...

mattldawson commented 4 years ago

I'm not sure what you mean by KPP stopping the execution of CAMP. Is this in the test? It seems from the output that KPP just printed the convergence failure message and let the test continue, but it would be possible for KPP to just exit the whole test (although I don't think it does this).

The fact that there are NaN rates in KPP seems like there must be some problem with the way the initial conditions are being passed to KPP. I would check the test code, particularly for blocks affected by DO_MPI flags.

cguzman95 commented 4 years ago

Thanks. From your deduction and the hints I apported, seems is only a problem of KPP. But I must add one more clue (the one that bring me here):

When executing test_cb05_monarch (the same cb05 with all the monarch input),with MPI ON and MPI OFF the results differs during the EBI comparison with CAMP. With MPI ON the test passes succesfully:

But with MPI=OFF:

The message of convergence failed could be perfectly from KPP, but the problem is that now the test fails with different results in CAMP by only disabling the MPI flag. I think it's something related with photo_rates because the test works fine if you set these rates to zero.

mattldawson commented 4 years ago

ah, ok - yeah seems like it could be a problem. could you somehow output the photolysis rates during the solving? to compare with EBI and between the MPI=ON/OFF?

I would also fix the KPP problem, so you can compare among the three, because EBI includes parameterizations that aren't in KPP or CAMP that can affect the results.

mattldawson commented 4 years ago

it also could be that whatever the problem is with KPP and MPI=OFF is also a problem with getting conditions to EBI or CAMP, but that the problem is showing up as a difference in results rather than a solver failure

cguzman95 commented 4 years ago

ah, ok - yeah seems like it could be a problem. could you somehow output the photolysis rates during the solving? to compare with EBI and between the MPI=ON/OFF?

Printing the BASE_RATE_ in rxn_photolysis_update_env_state, it shows the same photolysis rates set at the init, both with MPI=ON and MPI=OFF

This can sounds strange, but the error is not happening on mn4, only on p9 . CMake flags for mn4 are:

cmake -D CMAKE_C_COMPILER=$(which mpicc) \
-D CMAKE_Fortran_COMPILER=$(which mpiifort) \
-D CMAKE_BUILD_TYPE=release \
-D CMAKE_C_FLAGS_DEBUG="-std=c99 " \
-D CMAKE_C_FLAGS_RELEASE="-std=c99 -O3 " \
-D CMAKE_Fortran_FLAGS_DEBUG="" \
-D ENABLE_JSON=ON \
-D ENABLE_SUNDIALS=ON \
-D ENABLE_MPI=OFF \
-D ENABLE_GSL=ON \
-D ENABLE_TESTS=OFF \
..

Change the C and F flags to the same than p9 configuration doesn't make a change. Maybe is an error with the gcc compiler? Or an error that only shows gcc?

cguzman95 commented 4 years ago

Another discoverement, more related with the MONARCH bug but also related with the photolysis rate:

(with MPI=ON and photo_rates=X),I just enable the FAILURE_DETAIL flag, and in test_cb05 CVODE returns an error of convergence: "mxstep steps taken before reaching tout.", even when the final results are pretty similar from EBI.

Seems that when the photolysis rates are not homogeneus (0 or 0.01), CVODE doesn't converge. Me and Oriol are checking the setting of photolysis rates, maybe some values are wrong.

cguzman95 commented 4 years ago

More info about the convergence fail on CVODE when using monarch photo_rates:

Output:

Output when testing on my GPU branch (old camp version, GPU=OFF):

compdyn / partmc

test_cb05 CVODE convergence fails when MPI=OFF and different rates #143