compdyn / partmc

Particle-resolved stochastic atmospheric aerosol model
http://lagrange.mechse.illinois.edu/partmc/
GNU General Public License v2.0
27 stars 15 forks source link

Accuracy divergence under different compilers #151

Closed cguzman95 closed 3 years ago

cguzman95 commented 3 years ago

_This bug was found during the merge of the GPU branch develop-128-monarch-multicells into develop_129_merge_cpugpu (Github commit: 03d7efe1d345753d22db582cb953f6308d390b35)

The results obtained on CAMP and EBI differs from using ICC compiler and GCC. Common configuration:

Configuration for ICC:

Configuration for GCC:

Sample results from file CAMP_EBI_RESULTS.txt:

image

We can appreciate how the results differ in some species, especially in the case of NO. However, the results above the level of tolerance configured are exactly equal.

The results also vary on the results obtained by the EBI solver. Below we can appreciate multiple warnings from the EBI solver that only appears on the ICC case.

image

On simpler tests like mock_monarch_2 with 4 gas species, we saw very similar behaviour on overall. Only some small differences appear after during the 180 time-steps of the simulation. We can deduce the impact of the compiler depends also on the stiffness of the solver.

cguzman95 commented 3 years ago

When testing test_cb05 with -O0 flag on GNU compiler CVODE fails on solving ("failed in an unrecoverable manner"), with a WARNING telling that KPP returns NaN. This is an old issue because in my notes I have to compile always with -O3 since I started the CAMP development.

I'm worried this issue was hidden from time ago since the configuration CMAKE_BUILD_TYPE=release compiles by default with -O3 flag, and this is the default option on the Dockerfile (below I attached a line code sample from the compilation with "release" flag to show the -O3 flag).

[ 14%] Building C object CMakeFiles/partmclib.dir/src/rxns/rxn_photolysis.c.o
/apps/OPENMPI/3.0.0/GCC/bin/mpicc -DFAILURE_DETAIL -DPMC_USE_JSON -DPMC_USE_SUNDIALS -I/apps/NETCDF/4.4.1.1/GCC/OPENMPI/include -I/gpfs/scratch/bsc32/bsc32815/gpupartmc/cvode-3.4-alpha/install/include -I/gpfs/scratch/bsc32/bsc32815/gpupartmc/SuiteSparse/include -I/gpfs/scratch/bsc32/bsc32815/gpupartmc/json-fortran-6.1.0/install/jsonfortran-gnu-6.1.0/lib  -O3 -DNDEBUG   -std=c99 -o CMakeFiles/partmclib.dir/src/rxns/rxn_photolysis.c.o   -c /gpfs/scratch/bsc32/bsc32815/gpupartmc/partmc/src/rxns/rxn_photolysis.c

@mattldawson can you test the test_cb05 case with -O0 flag? I want to check if only happens in my case or it is a common problem hidden because we always compiled with -O3. To enable -O0 you can use these flags-D CMAKE_C_FLAGS_RELEASE="-O0" -D CMAKE_Fortran_FLAGS_RELEASE="-O0" \

mattldawson commented 3 years ago

Hi @cguzman95 -

I usually test with CMAKE_BUILD_TYPE=debug in the docker container because I run some tests with valgrind. I haven't seen this problem before, but I just built an image from the latest commit on the develop-137-urban-plume-camp: ebdfc7fa, ran it in a container, set CMAKE_BUILD_TYPE=debug and added the -O0 flag to CMAKE_C_FLAGS_DEBUG and CMAKE_Fortran_FLAGS_DEBUG and the tests all passed. This is the output for the cb05 test:

23/85 Test: test_chemistry_cb05cl_ae5
Command: "/build/test_run/chemistry/cb05cl_ae5/test_chemistry_cb05cl_ae5.sh" "serial"
Directory: /build
"test_chemistry_cb05cl_ae5" start time: Dec 09 17:51 UTC
Output:
----------------------------------------------------------
# make sure that the current directory is the one where this script is
cd ${0%/*}
# make the output directory if it doesn't exist
mkdir -p out

((counter = 1))
while [ true ]
do
  echo Attempt $counter

if ! ../../../test_chemistry_cb05cl_ae5; then
          echo Failure "$counter"
          if [ "$counter" -gt 1 ]
          then
                  echo FAIL
                  exit 1
          fi
          echo retrying...
  else
          echo PASS
          exit 0
  fi
  ((counter++))
done
Attempt 1
 EBI initialization time:    1.7000000000001389E-005  s
 KPP initialization time:    4.4999999999999901E-005  s
 CAMP-chem initialization time:    3.7775999999999997E-002  s
 Comparing rates
 EBI calculation time:   0.26841900000003349       s
 KPP calculation time:    1.4976980000000304       s
 CAMP-chem calculation time:    4.5463229999999779       s
 CB5 mechanism tests - PASS
PASS
<end of output>
Test time =   6.61 sec
----------------------------------------------------------
Test Passed.
"test_chemistry_cb05cl_ae5" end time: Dec 09 17:51 UTC
"test_chemistry_cb05cl_ae5" time elapsed: 00:00:06

Can you try this and see if it works for you?

cguzman95 commented 3 years ago

Okay, It's working fine with your last commit.

Well, at least I delimited the problem to "only" my code.

cguzman95 commented 3 years ago

Okay... The bug was found on test_cb05. I was doing KPP_PHOTO_RATES(:) = photo_rates(:)/60 which should be the same as the original code KPP_PHOTO_RATES(:) = 0.0001 since photo_rates(:) = 0.0001 * 60.0 on both cases. But for some reason, it doesn't assign correctly the values and it produces the error we have seen...

Fortran things I guess