routine meshfem3D_models_getatten_val() from Hejun Zhu is unreliable and gives wrong results in SPECFEM3D_GLOBE

mpbl commented 11 years ago

In SPECFEM3D_GLOBE, fix this known 64-bit issue when -mcmodel=medium -shared-intel is used in order to handle very large static memory size per core (the 3D_Cartesian and 2D packages are fine because that compiler option is never used in them):

If you run very large meshes on a relatively small number of processors, the static memory size needed on each processor might become greater than 2 gigabytes, which is the upper limit for 32-bit addressing (dynamic memory allocation is always OK, even beyond the 2 GB limit; only static memory has a problem). In this case, on some compilers you may need to add -mcmodel=medium (if you do not use the Intel ifort / icc compiler) or -mcmodel=medium -shared-intel (if you use the Intel ifort / icc compiler) to the configure options of CFLAGS, FCFLAGS and LDFLAGS otherwise the compiler will display an error message (for instance 'relocation truncated to fit: R_X86_64_PC32 against .bss' or something similar);

BEWARE that using -mcmodel=medium -shared-intel is known for currently leading to incorrect seismograms, at least when flag ATTENUATION is on (and maybe even without), at least in the case of the Intel ifort / icc compiler. This likely comes from intrinsic functions such as size() that return and integer8 instead of an integer4, thus leading to incorrect results when used in function calls is the -i8 flag is not added to the compiler options; however, when adding -i8 the code does not compile because the MPI calls then refuse to compile (they need integer4 as arguments). It seems that this problem is triggered when 3D models are used and mostly (only?) when attenuation is on, but not triggered for 1D Earth model benchmarks even with attenuation (at least not triggered for EXAMPLES/benchmarks/attenuation_benchmark_GJI_2002_versus_normal_modes nor for Anne Sieminski's three 1D benchmarks), which makes it uneasy to locate. Since most current users run the code with less than 2 GB of static memory per core, we have not investigated that problem carefully for now, and thus recommend that you do NOT use these compiler flags until this problem is located and fixed.

(todo_list_please_dont_remove.txt: suggestion 45)

QuLogic commented 10 years ago

This likely comes from intrinsic functions such as size() that return and integer8 instead of an integer4, thus leading to incorrect results when used in function calls is the -i8 flag is not added to the compiler options; however, when adding -i8 the code does not compile because the MPI calls then refuse to compile (they need integer4 as arguments).

This explanation doesn't really make sense to me. The size intrinsic should return an integer of the default kind, and changing the memory model shouldn't affect that?

Is there a test case for this?

komatits commented 10 years ago

This is a bug that is difficult to locate (Daniel and I tried in August but never managed to locate it). To trigger it you can use any example with a 3D Earth model and run it with the Intel ifort compiler with and without -mcmodel=medium -shared-intel, and you will get slightly different seismograms (the difference being significant; a few percent, sometimes more).

All 1D examples we have tested are fine (i.e. give identical seismograms with and without -mcmodel=medium -shared-intel), thus the bug is likely in the 3D routines.

The bug seems to be in the solver rather than in the mesher, because running a mesh created with or without -mcmodel=medium -shared-intel gives the same seismograms when run through the compiled solver; it is only when the solver itself is compiled with and without -mcmodel=medium -shared-intel that we get different seismograms.

The "size()" explanation below is something I found on the Web in August (but I forgot to keep the link); as you say, it seems unlikely.

Fixing this would be really great; please let us know if you can locate the problem.

If I remember correctly, ATTENUATION makes the bug worse (it could even be that there is no bug when ATTENUATION is off; I do not remember).

Thanks, Dimitri.

On 12/18/2013 05:20 AM, Elliott Sales de Andrade wrote:

This likely comes from intrinsic functions such as size() that
return and integer8 instead of an integer4, thus leading
to incorrect results when used in function calls is the -i8 flag is
not added to the compiler options;
however, when adding -i8 the code does not compile because the MPI
calls then refuse to compile (they need integer4 as arguments).
This explanation doesn't really make sense to me. The |size| intrinsic should return an integer of the default kind, and changing the memory model shouldn't affect that?

Is there a test case for this?

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d/issues/8#issuecomment-30815164.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

QuLogic commented 10 years ago

I tried to reproduce without any luck so far, though it was somewhat older code and with a 3D_anisotropic model that was mostly isotropic. When I have a chance to update to the newest code, I will try again.

komatits commented 10 years ago

I need to work on GLOBE this week for other things, thus I will have a look again and will try to create an example (benchmark) for which the problem appears (examples for which it does not appear are already available: for instance all the benchmarks with 1D models and normal-mode reference solutions seem OK. I will keep you posted.

On 01/07/2014 05:36 AM, Elliott Sales de Andrade wrote:

I tried to reproduce without any luck so far, though it was somewhat older code and with a 3D_anisotropic model that was mostly isotropic. When I have a chance to update to the newest code, I will try again.

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d/issues/8#issuecomment-31712858.Web Bug from https://github.com/notifications/beacon/5817129__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcwNDYwMjE5NiwiZGF0YSI6eyJpZCI6MjAxMzExNzJ9fQ==--74f98618f14160210012f6be1d5b5504c465d66d.gif

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

komatits commented 10 years ago

I will try to work on this this week (at least create a test case to reproduce the bug)

komatits commented 10 years ago

@mpbl @wjlei1990 @QuLogic @jas11 @danielpeter this turns out to be a major bug: Ebru's current runs are maybe highly affected, the seismograms being wrong (at least on some machines). I have analyzed the problem in detail, see my new directory called "IMPORTANT_BUG_benchmark_Ebru_Par_file" in SPECFEM3D_GLOBE, and in particular the two files README_current_conclusions_from_Dimitri_about_the_bug.txt and HOWTO_plot_the_results.txt in that directory.

We should make fixing this a high priority because for 3D Earth models with attenuation the seismograms are currently wrong (at least on some machines) (and as you will see when plotting the results, the differences are very large).

I tried to locate that bug back in August (unsuccessfully) and checked again today, I see nothing obvious so far.

It seems the problem has been there for at least a year, probably more.

komatits commented 10 years ago

@mpbl @wjlei1990 @QuLogic @jas11 @danielpeter Comment on my message above from 4 months ago, which was: "The bug seems to be in the solver rather than in the mesher, because running a mesh created with or without -mcmodel=medium -shared-intel gives the same seismograms when run through the compiled solver; it is only when the solver itself is compiled with and without -mcmodel=medium -shared-intel that we get different seismograms."

--> I now get the opposite behavior (i.e. I think the bug is likely in the mesher), and I made this comment in December but had done the tests in August; thus we can probably ignore this old comment (?); or else the problem is random (?)

--> in the meantime we have also switched to Daniel's new version; however I do not see how this could lead to the opposite behavior

mpbl commented 10 years ago

@komatits In this thread you are saying that -mcmodel=medium -shared-intel with intel compiler sis a problem. and in README_current_conclusions_from_Dimitri_about_the_bug.txt you are saying that mcmodel=medium is also a problem with gnu compilers.

Am I right?

mpbl commented 10 years ago

@komatits Can you provide the exact Par_files for the various examples. In particular for: attenuation_but_without_ATTENUATION_1D_WITH_3D_STORAGE_is_NOT_OK

komatits commented 10 years ago

Hi,

No, sorry about the confusion, here are the conclusions:

for Ebru's Par_file, ifort with -mcmodel=medium -shared-intel and ifort without -mcmodel=medium -shared-intel give very different seismograms
for that same Par_file, GNU gfortran gives the same seismograms with or without -mcmodel=medium, BUT gfortran v4.6 gives very different seismograms from gfortran v4.8 (and in addition they also differ from the seismograms given by Intel ifort)

So basically some memory gets corrupted at some point, and the problem is severe because memory is altered very differently between ifort, gfortran v4.6 and gfortran v4.8 and thus the seismograms change drastically.

The bug is triggered only when a 3D Earth model is used and attenuation is turned on. 3D models without attenuation are OK, and 1D models with attenuation are also OK.

Thanks, Dimitri.

On 10/04/2014 17:16, Matthieu Lefebvre wrote:

@komatits https://github.com/komatits In this thread you are saying that -mcmodel=medium -shared-intel with intel compiler sis a problem. and in README_current_conclusions_from_Dimitri_about_the_bug.txt you are saying that mcmodel=medium is also a problem with gnu compilers.

Am I right?

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d/issues/8#issuecomment-40098028.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

komatits commented 10 years ago

PS: that is why the bug went undetected, because all current benchmarks are fine and fit the reference solutions since 1D models with attenuation are OK. It is also hard to debug because we do not know what the right seismograms should be (or are).

On 10/04/2014 19:34, Dimitri Komatitsch wrote:

Hi,

No, sorry about the confusion, here are the conclusions:

for Ebru's Par_file, ifort with -mcmodel=medium -shared-intel and ifort without -mcmodel=medium -shared-intel give very different seismograms

for that same Par_file, GNU gfortran gives the same seismograms with or without -mcmodel=medium, BUT gfortran v4.6 gives very different seismograms from gfortran v4.8 (and in addition they also differ from the seismograms given by Intel ifort)

So basically some memory gets corrupted at some point, and the problem is severe because memory is altered very differently between ifort, gfortran v4.6 and gfortran v4.8 and thus the seismograms change drastically.

The bug is triggered only when a 3D Earth model is used and attenuation is turned on. 3D models without attenuation are OK, and 1D models with attenuation are also OK.

Thanks, Dimitri.

On 10/04/2014 17:16, Matthieu Lefebvre wrote:

@komatits https://github.com/komatits In this thread you are saying that -mcmodel=medium -shared-intel with intel compiler sis a problem. and in README_current_conclusions_from_Dimitri_about_the_bug.txt you are saying that mcmodel=medium is also a problem with gnu compilers.

Am I right?

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d/issues/8#issuecomment-40098028.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

QuLogic commented 10 years ago

Looks like I can reproduce it here with:

$ ifort --version
ifort (IFORT) 12.1.3 20120212
Copyright (C) 1985-2012 Intel Corporation.  All rights reserved.
$ module list
Currently Loaded Modulefiles:
  1) intel/12.1.3                3) gcc/4.6.1                   5) extras/64_6.4
  2) openmpi/1.4.4-intel-v12.1   4) python/2.7.2                6) gnuplot/4.2.6

The only difference is the addition of -mcmodel=medium -shared-intel to FLAGS_CHECK and CFLAGS.

bug

komatits commented 10 years ago

OK, thanks!

On 10/04/2014 20:53, Elliott Sales de Andrade wrote:

Looks like I can reproduce it here with:

$ ifort --version ifort (IFORT) 12.1.3 20120212 Copyright (C) 1985-2012 Intel Corporation. All rights reserved. $ module list Currently Loaded Modulefiles: 1) intel/12.1.3 3) gcc/4.6.1 5) extras/64_6.4 2) openmpi/1.4.4-intel-v12.1 4) python/2.7.2 6) gnuplot/4.2.6

The only difference is the addition of |-mcmodel=medium -shared-intel| to |FLAGS_CHECK| and |CFLAGS|.

bug https://cloud.githubusercontent.com/assets/302469/2671983/41db6212-c0e1-11e3-874d-e96b24904b40.png

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d/issues/8#issuecomment-40123979.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

QuLogic commented 10 years ago

BTW, does anyone have an idea of when this might have been working correctly (if ever)?

komatits commented 10 years ago

hard to tell, because there is no reference solution. I am pretty sure last August the code was already wrong, but I am not sure regarding previous years. We could try to rerun very old versions from say 2007, 2008, 2009, 2010, 2011, 2012 and see if (and when) seismograms start to vary

komatits commented 10 years ago

My feeling is that the problem could be related to attenuation parameters that do not get broadcast correctly to the MPI slices when a 3D model is used (i.e. they are probably erased or corrupted), and for some reason this does not happen when a 1D model is used. Just guessing.

komatits commented 10 years ago

Another thing that could help would be to run the code through "valgrind"; however when I tried that last August I got tons of unrelated messages coming from the MPI library (that is a known issue at least with MPICH; I am not sure regarding OpenMPI)

komatits commented 10 years ago

Very good idea. Done.

On 10/04/2014 17:48, Matthieu Lefebvre wrote:

@komatits https://github.com/komatits Can you provide the exact Par_files for the various examples. In particular for: attenuation_but_without_ATTENUATION_1D_WITH_3D_STORAGE_is_NOT_OK

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d/issues/8#issuecomment-40101991.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

komatits commented 10 years ago

@mpbl @wjlei1990 @QuLogic @jas11 @danielpeter good news, the bug seems to be specific to model s362ani. When switching to s40rts in Ebru's Par_file everything works fine. Thus the bug is likely somewhere in the s362ani routines (or in the way they MPI_Bcast their parameters, for instance the attenuation parameters). s40rts_with_attenuation_is_ok

komatits commented 10 years ago

@mpbl @wjlei1990 @QuLogic @jas11 @danielpeter more precisely good news and bad news, because s362ani is the default model... thus as soon as this bug is fixed we should commit that change to the "master" branch and inform users

komatits commented 10 years ago

@mpbl @wjlei1990 @QuLogic @jas11 @danielpeter good news, s40rts with attenuation works for ifort, gfortran, pgf90, and also for mcmodel=medium; thus, the problem really seems to be specific to s362ani good_news_s40rts_with_attenuation_works_for_ifort_gfortran_pgf90_and_also_mcmodel_medium

komatits commented 10 years ago

@mpbl @wjlei1990 @QuLogic @jas11 @danielpeter it is probably something in model_s362ani.f90 that corrupts the memory at some point, thus erasing at least in part some of the attenuation arrays (?). The Fortran in model_s362ani.f90 is so poor that it is difficult to understand and very difficult to debug.

QuLogic commented 10 years ago

I have checked out a fairly old version:

commit 76e77ba90ee0e1154b7f9851e093d35f4517f220
Author: Dimitri Komatitsch <komatitsch@lma.cnrs-mrs.fr>
Date:   Sat Jan 7 00:45:36 2012 +0000

    suppressed DATA/Par_file.in and DATA/SOURCE.in handled by "configure", since it was both useless and dangerous (modifications made by the user wou

which appears to work correctly.

bug

I will set up a bisection to see if I can narrow down the problem. How long this will take depends on how busy the cluster is, though.

komatits commented 10 years ago

Great! On my side I am going to download the original Fortran77 version of s362ani again, convert it to F90 and see if that version works fine; however it is so ugly that it is difficult to analyze

On 12/04/2014 04:02, Elliott Sales de Andrade wrote:

I have checked out a fairly old version:

|commit 76e77ba90ee0e1154b7f9851e093d35f4517f220 Author: Dimitri Komatitsch komatitsch@lma.cnrs-mrs.fr Date: Sat Jan 7 00:45:36 2012 +0000
 suppressed DATA/Par_file.in and DATA/SOURCE.in handled by "configure", since it was both useless and dangerous (modifications made by the user wou
which appears to work correctly.

bug https://cloud.githubusercontent.com/assets/302469/2686112/6b5f9eae-c1e6-11e3-9822-f14535760826.png

I will set up a bisection to see if I can narrow down the problem. How long this will take depends on how busy the cluster is, though.

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d/issues/8#issuecomment-40269055.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

komatits commented 10 years ago

@QuLogic could you try the old Jan 2012 version you mention with GNU gfortran? because I guess it will NOT work (I have detected some problems in the original Fortran77 version of the s362ani routines from Columbia and Harvard (see below) thus I do not see how any later version could work fine, since we did not really change it (apart from converting from F77 to F90)

komatits commented 10 years ago

@mpbl @wjlei1990 @QuLogic @jas11 @danielpeter very bad news regarding s362ani, I took the original F77 routines from the Columbia web site http://www.ldeo.columbia.edu/~ekstrom/Projects/3D/BK/models.html and called them directly instead of using our current model_s362ani.f90 routine (see my new source code in https://www.lma.cnrs-mrs.fr/Filez/rwtiyp ); I then get four (very) different sets of seismograms when compiling that with: Intel ifort, Intel ifort with -mcmodel=medium, GNU gfortran, and Portland pgf90. Thus it seems that there is no bug in SPECFEM and that it is the original Columbia routine that is highly unreliable. Please confirm! Thanks.

To compile that new version, in flags.guess you will need to get rid of "implicit none" compiler options as well as any type of syntax checking because their F77 routine uses implicit declarations as well as many obsolete statements.

I made a couple of (very safe) minor changes in their original routine, look for "DK DK" in src/meshfem3D/model_s362ani.f90).

komatits commented 10 years ago

@mpbl @wjlei1990 @QuLogic @jas11 @danielpeter of course we cannot completely rule out problems in our attenuation routines, since s362ani without attenuation seems to give the same seismograms with the above compiler options; however, PREM + attenuation fits a reference benchmark perfectly, and 3D model s40rts + attenuation is also OK. Considering how ugly the Fortran used in model_s362ani.f90 is, my guess is that that routine has problems, but that these problems are triggered when attenuation is on just because memory is mapped differently in that case (since we use static memory allocation). To be completely sure of that, overnight I am going to run another 3D model with attenuation, in addition to s40rts. I will then stop testing, in part by lack of time but also because s362ani.f90 is so poorly written, with no comments, that it is impossible to check carefully; if it is broken, Columbia should fix it.

QuLogic commented 10 years ago

@komatits Are the changes to src/auxiliaries, src/shared and src/specfem3d really necessary?

Should we also be wary of any of the other models related to s362ani (1dref, s29ea, etc.)?

komatits commented 10 years ago

No, only in src/meshfem3D.

I have not tested the other s362 models, but we probably should because they call the same routines.

On 15/04/2014 00:46, Elliott Sales de Andrade wrote:

@komatits https://github.com/komatits Are the changes to |src/auxiliaries|, |src/shared| and |src/specfem3d| really necessary?

Should we also be wary of any of the other models related to s362ani (1dref, s29ea, etc.)?

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d/issues/8#issuecomment-40427420.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

komatits commented 10 years ago

@mpbl @wjlei1990 @QuLogic @jas11 @danielpeter The four new runs with s20rts + attenuation (using Intel ifort, Intel ifort with -mcmodel=medium, GNU gfortran, and Portland pgf90) give the exact same seismograms. Thus: s20rts is OK, s40rts is OK, s362ani is not (when attenuation is on).

QuLogic commented 10 years ago

@komatits Odd, I have not been able to reproduce the problem with gfortran on the Jan 2012 code, and I tried several versions. Also, using the code you posted, I got identical seismograms with ifort, ifort+mcmodel=medium, and gfortran.

Trying to bisect is nearly impossible so far as almost all the 2013 code either does not compile, or does not run. I have narrowed it down from ~800 to ~80 commits though. There are only 3 or 4 commits that change model_s362ani.f90, but given that I can't reproduce with it, I'm not sure whether it's the only factor.

PS, now would be a good time to mention that distributed nature of git. Instead of passing around tarballs, you can commit to a branch, and other developers can fetch from your fork. For example, I have taken your changes and committed them on my fork in the s362ani-bug branch. To use it on an existing clone, one can:

git remote add QuLogic https://github.com/QuLogic/specfem3d_globe/tree/s362ani-bug
git remote fetch QuLogic
git checkout -b s362ani-bug QuLogic/s362ani-bug

mpbl commented 10 years ago

@komatits "Another thing that could help would be to run the code through "valgrind"; however when I tried that last August I got tons of unrelated messages coming from the MPI library (that is a known issue at least with MPICH; I am not sure regarding OpenMPI)"

--> ddt gives only a couple of false positive errors (I can send you the log if you want). However, -Wconversion (gfortran) to compile model_s362ani.f90 returns a few r8 to r4 implicit conversion errors. I do not know how this could have an impact on the memory model.

QuLogic commented 10 years ago

@mpbl If the log's not too big and text (I don't recall if it is), you can post it as a Gist.

komatits commented 10 years ago

Hi,

This is good news I think. I then do not understand why I get different results on my side, with the same code, unless there is a random bug somewhere. What is weird is that on your side you do get the same behavior I get (i.e. different seismograms) using the Aug 2013 version. Strange...

On 15/04/2014 21:31, Elliott Sales de Andrade wrote:

@komatits https://github.com/komatits Odd, I have not been able to reproduce the problem with gfortran on the Jan 2012 code, and I tried several versions. Also, using the code you posted, I got identical seismograms with ifort, ifort+mcmodel=medium, and gfortran.

Trying to bisect is nearly impossible so far as almost all the 2013 code either does not compile, or does not run. I have narrowed it down from ~800 to ~80 commits though. There are only 3 or 4 commits that change |model_s362ani.f90|, but given that I can't reproduce with it, I'm not sure whether it's the only factor.

PS, now would be a good time to mention that distributed nature of git. Instead of passing around tarballs, you can commit to a branch, and other developers can fetch from your fork. For example, I have taken your changes and committed them on my fork in the s362ani-bug branch https://github.com/QuLogic/specfem3d_globe/tree/s362ani-bug. To use it on an existing clone, one can:

git remote add QuLogic https://github.com/QuLogic/specfem3d_globe/tree/s362ani-bug git remote fetch QuLogic git checkout -b s362ani-bug QuLogic/s362ani-bug

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d/issues/8#issuecomment-40523912.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

komatits commented 10 years ago

Thanks! This is very good news.

On 15/04/2014 21:33, Matthieu Lefebvre wrote:

@komatits https://github.com/komatits "Another thing that could help would be to run the code through "valgrind"; however when I tried that last August I got tons of unrelated messages coming from the MPI library (that is a known issue at least with MPICH; I am not sure regarding OpenMPI)"

--> ddt gives only a couple of false positive errors (I can send you the log if you want). However, -Wconversion (gfortran) to compile model_s362ani.f90 returns a few r8 to r4 implicit conversion errors. I do not know how this could have an impact on the memory model.

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d/issues/8#issuecomment-40524164.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

QuLogic commented 10 years ago

So I'm still trying to get through this bisect, but I have reached a point where code either does not compile or does not run, so it is difficult.

@komatits I think I have figured out why you were confused about whether the bug was in the solver or the mesher. I believe there were actually two bugs. I bisected the original trunk (i.e., before merging the sunflower branch), and it pointed to geodynamics/specfem3d_globe@dfbb99f5cd652766d07aebe2036e904f74cc03b0. This change wasn't exactly the problem, but rather there was an extra argument in the call to initialize_simulation that caused the solver to misbehave. That was likely the original bug you saw and thought was in the solver. However, the code now puts all these variables in modules instead of large unwieldy argument lists, so it is not the current bug.

QuLogic commented 10 years ago

The current bug appears to have been introduced some time during the ADIOS branch or the merge of the sunflower/ADIOS branch back into trunk; and this time appears to be in the mesher. Unfortunately, it is next to impossible to narrow it down further since half the commits are semi-complete merges, and the remaining half cannot be compiled or run.

All I can say is that geodynamics/specfem3d_globe@b04f025d1505898e8e286257a4cab4f973b6ab38 (one of the first commits on the ADIOS branch) worked correctly, and geodynamics/specfem3d_globe@73690a718f4b489142e07aaf943092b86f8f5e43 (some time after the merge) is the earliest commit I can find that compiles, runs, and shows the problem. Everything in between either doesn't compile, or doesn't run. Unfortunately, that leaves around 100 commits in which the problem might have been introduced. The bisect tells me there are 56 commits left to test, though I doubt I will find one that works.

komatits commented 10 years ago

OK, great. This is good news, it is consistent with the tests I had done back then that showed that the solver was also broken.

On 16/04/2014 10:15, Elliott Sales de Andrade wrote:

So I'm still trying to get through this bisect, but I have reached a point where code either does not compile or does not run, so it is difficult.

@komatits https://github.com/komatits I think I have figured out why you were confused about whether the bug was in the solver or the mesher. I believe there were actually /two/ bugs. I bisected the original trunk (i.e., before merging the sunflower branch), and it pointed to dfbb99f https://github.com/geodynamics/specfem3d/commit/dfbb99f5cd652766d07aebe2036e904f74cc03b0. This change wasn't exactly the problem, but rather there was an extra argument in the call to |initialize_simulation| that caused the solver to misbehave. That was likely the /original/ bug you saw and thought was in the solver. However, the code now puts all these variables in modules instead of large unwieldy argument lists, so it is not the /current/ bug.

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d/issues/8#issuecomment-40574056.Web Bug from https://github.com/notifications/beacon/5817129__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcxMzI1NTMzOSwiZGF0YSI6eyJpZCI6MjAxMzExNzJ9fQ==--cb45210fd379fd8f94bb80a120fffde580d3994c.gif

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

komatits commented 10 years ago

@mpbl @wjlei1990 @QuLogic @jas11 @danielpeter

Princeton and Toronto both confirm the major bug: @mpbl and @QuLogic see very different seismograms between gfortran and Intel ifort for Ebru's Par_file (I see the same weird behavior on http://www-hpc.cea.fr/fr/complexe/tgcc-curie.htm in Europe) princeton_tiger_gfortran_vs_intel_ifort_for_ebru_s_par_file

QuLogic commented 10 years ago

The results of my analysis can be found here. I have tested three compilers: ifort, ifort with mcmodel=medium, and gfortran; and I have tested 4 commits:

s362ani-bug: the HEAD of the s362ani-bug branch, which is devel + patch from Dimitri to use original s362ani code.
Jan 2012: an old revision I found which appears to work.
Last Good: the last known good commit found during the bisect.
First Bad: the first known bad commit found during the bisect.

In date order, it appears that:

Jan 2012 matches well (though not perfectly); the difference between ifort and gfortran is ~2e-7.
Last Good also matches overall, but the difference between ifort and ifort+mcmodel=medium is now larger than against gfortran. Not sure why.
First Bad clearly shows ifort different from the other two.
s362ani-bug also shows ifort different from the other two.

We can also look at how things play out over time:

For ifort,
- Jan 2012 -> Last Good appears to have no change,
- Last Good -> First Bad has a small but noticeable change,
- First Bad -> s362ani-bug has an even larger change.
For ifort + mcmodel=medium,
- Jan 2012 -> Last Good appears to have no change,
- Last Good -> First Bad has a small but noticeable change,
- First Bad -> s362ani-bug appears to return to normal somehow?
For gfortran,
- Jan 2012 -> Last Good appears to have no change,
- Last Good -> First Bad has a small but noticeable change,
- First Bad -> s362ani-bug appears to return to normal again?

@mpbl also sent the results from his side (which are only for s362ani-bug), and things are a little different. Instead of ifort, gfortran is the odd man out this time. There's also a comparison between my results and Matthieu's results for ifort, ifort+mcmodel=medium, and gfortran. It appears ifort and gfortran are swapped for him?

QuLogic commented 10 years ago

In summary, it looks like from Jan 2012 up to Last Good is okay, though the increased differences could indicate a problem. The First Bad commit clearly causes trouble in all instances. The patch from @komatits in s362ani-bug seems to fix some options, but unfortunately not all of them.

Also, @komatits has tested with gfortran 4.8 which apparently gives another different result, so there's a possibility that my known good commits don't work with it. I'm hoping that's not the case.

Now having said all that, unfortunately, there are many commits between Last Good and First Bad, and I have been unable to compile any of them. So I cannot narrow things down any further, and the diff between these two commits is possibly too large to find any problems.

komatits commented 10 years ago

Hi Elliott,

Thanks a lot. Extremely useful. Tomorrow I will check your "Last Good" with gfortran4.8, and I will also carefully check the differences between Last Good and First Bad using "meld" or similar, http://meldmerge.org/

Thanks! Dimitri.

On 18/04/2014 02:18, Elliott Sales de Andrade wrote:

In summary, it looks like from Jan 2012 https://github.com/geodynamics/specfem3d_globe/commit/76e77ba9 up to Last Good https://github.com/geodynamics/specfem3d_globe/commit/b04f025d is okay, though the increased differences could indicate a problem. The First Bad https://github.com/geodynamics/specfem3d_globe/commit/d1de5ba7 commit clearly causes trouble in all instances. The patch from @komatits https://github.com/komatits in s362ani-bug https://github.com/QuLogic/specfem3d_globe/tree/s362ani-bug seems to fix some options, but unfortunately not all of them.

Also, @komatits https://github.com/komatits has tested with gfortran 4.8 which apparently gives another different result, so there's a possibility that my known good commits don't work with it. I'm hoping that's not the case.

Now having said all that, unfortunately, there are /many/ commits between Last Good and First Bad, and I have been unable to compile any of them. So I cannot narrow things down any further, and the diff between these two commits is possibly too large to find any problems.

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d/issues/8#issuecomment-40776694.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

komatits commented 10 years ago

@mpbl @wjlei1990 @QuLogic @jas11 @danielpeter
I confirm that the s362anibug version (i.e. the version with my Fortran77 patch) works fine in the case of model s362ani without ATTENUATION

s362anibug_version_with_model_s362ani_and_without_attenuation_works_ok

komatits commented 10 years ago

@mpbl @wjlei1990 @QuLogic @jas11 @danielpeter gfortran 4.8 fails for Last_Good. However, all the other compilers (including gfortran 4.6) agree. Since gfortran4.8 is NOT the default GNU compiler on the machine I use, we can probably safely assume that it simply does not work, and ignore it (it gave me weird warnings at compile time for other codes, which made no sense, while gfortran4.6 worked fine with no warnings).

However, to get independent confirmation, we have one more option: running that case on Titan at ORNL using the Cray compiler instead of the default, which is Portland there. @mpbl could you please run it and send us the seismogram? to switch from Portland to Cray at ORNL you can refer to https://www.olcf.ornl.gov/kb_articles/controlling-the-programming-environment/

With gfortran 4.8:

gfortran_4_8_fails

All the others:

but_the_others_all_agree

komatits commented 10 years ago

Hi Matthieu,

OK, then it means the bug is really random and the code can switch kind of randomly between three possible sets of results (or maybe only two, if we can exclude gfortran4.8).

And that random switch seems to occur before the time loop, because once the code is in one of these two modes it remains there for the whole run.

QuLogic commented 10 years ago

I tried out with gfortran 4.8. As you say, it doesn't appear to work; though, as with 4.6, it didn't change too much over time.

But I was unable to run it at all on the Jan 2012 revision; unlike everything else, it just blew up in the solver. So I'm not sure if there's a different bug or it's just unreliable.

komatits commented 10 years ago

Hi Elliott, @QuLogic

Thanks for the gfortran4.8 test. Very useful. What is really strange is that on my side when I use gfortran4.8 for model s20RTS with attenuation, everything works fine and fits the rest (see the picture below). When I use it for Last Good with s362ani + attenuation then I get seismograms that differ from the rest. This seems to show that the problem comes from s362ani, but also (unfortunately) that Last Good also has a problem (?).

gfortran_4_8_works_fine_for_s20rts_with_attenuation

komatits commented 10 years ago

@mpbl @wjlei1990 @QuLogic @jas11 @danielpeter Based on my gfortran4.8 run for s20RTS instead of s362ani and based on the run for s362ani on Titan with the Cray compiler, it is now confirmed that the code can (kind of randomly) switch between at least three types of seismograms, because the Titan Cray result for the "Last Good" version fits my gfortran4.8 result perfectly (see below).

Thus:

the "Last Good" version is NOT good
the bug is very old (before Jan 2012 for sure, probably since the day someone added the s362ani model; the bug is also in the original s362ani routine from Columbia, because downloading it again and using it directly in SPECFEM does not fix the problem
the random switch occurs before the time loop, because once the code is in one of the three modes it remains there for the whole run; thus the switch probably occurs in the mesher rather than in the solver
as a conclusion, ALL forward runs for s362ani are very unreliable and likely wrong (since Jan 2012 at least, probably since s362ani support was added); inverse problems are likely OK after a few iterations, but the first iterations make the iterative problem more difficult because the initial model / models are significantly wrong

Last Good is not good:

lastgood_is_not_good

Last Good Titan Cray fits my Last Good gfortran 4.8 (which differs from my Last Good gfortran 4.6, not shown here):

lastgood_gfortran_4_8_paris_machine_fits_lastgood_cray_compiler_titan

komatits commented 10 years ago

From @QuLogic :

v5.1.5 also does not work. See first plot here: http://nbviewer.ipython.org/github/QuLogic/s362ani-bug-analysis/blob/master/Analysis.ipynb#Toronto-Results

As with Jan 2012, the ifort and ifort+mcmodel=medium are similar, with gfortran 4.6 a bit farther off. But gfortran 4.8 is way off. Though this time the difference between ifort and ifort+mcmodel=medium is a bit stronger.

komatits commented 10 years ago

Hi all,

It is Qmu_store() that differs.

(then tau_e_store() also differs, but that is normal because it is computed based on Qmu_store()).

Dimitri.

On 21/04/2014 00:19, Dimitri Komatitsch wrote:

Hi all,

I have just found it.

I dumped all the arrays we get at the end of the mesher, from two runs that differ: ifort, and ifort + mcmodel=medium. Initially I dumped all the density and velocity arrays to disk and used "diff" to compare them, and found no difference. But in two new runs I have just dumped:

write(937,_) ispec_istiso write(937,) c11store,c12store,c13store,c14store,c15store,c16store,c22store, &

c23store,c24store,c25store,c26store,c33store,c34store,c35store, & c36store,c44store,c45store,c46store,c55store,c56store,c66store write(937,_) Qmustore write(937,) tau_e_store

and I see some differences.

SPECFEM / specfem3d

routine meshfem3D_models_getatten_val() from Hejun Zhu is unreliable and gives wrong results in SPECFEM3D_GLOBE #8