SPECFEM / specfem3d_globe

SPECFEM3D_GLOBE simulates global and regional (continental-scale) seismic wave propagation.
GNU General Public License v3.0
90 stars 95 forks source link

Inconsistent AK135 results with different compilers #202

Closed QuLogic closed 9 years ago

QuLogic commented 10 years ago

Upon switching systems (and compilers), I've noticed that the resulting seismograms do not seem to match any more. The change was from a system using xlf to one using ifort.

I originally suspected the -qnostrict option to xlf was the cause of the discrepancy, but further testing shows that while this option makes a difference, it is not the consistent cause of the problem. The results appear to vary depending on the NEX chosen.

I've run tests against four compiler (options):

The results are as follows:

You can view the analysis and some additional plots here: NEX=96, NEX=144, NEX=192, NEX=240 or at this repository. I did not upload all the waveforms since they take up about 3.5 GB.

QuLogic commented 10 years ago

Just to be clear, I ran these tests with 2a2b3303e28ec46064fd33efa5d055aef2f5f442, which should be the latest commit on the devel branch, but I also saw the differences with previous commits. You can also obtain the Par_file, CMTSOLUTION, and STATIONS from the mentioned repository.

QuLogic commented 10 years ago

Tried decreasing DT to 25% of default, but no luck there...

QuLogic commented 10 years ago

Tried swapping databases between ifort and gfortran. No dice. Also wrote some scripts to compare the databases and don't see anything major. So the mesher appears to be working the same in both cases.

Any other ideas, @komatits?

QuLogic commented 10 years ago

I made one last try with all the options enabled (OCEANS, ELLIPTICITY, TOPOGRAPHY, GRAVITY, ROTATION, ATTENUATION), but this does not correct things.

komatits commented 10 years ago

Did you try running your example (in particular the mesher) with ifort -check all -debug -g -O0 -fp-stack-check -traceback -ftrapuv ?

Dimitri.

On 13/08/2014 07:49, Elliott Sales de Andrade wrote:

I made one last try with all the options enabled http://nbviewer.ipython.org/github/QuLogic/ak135-test/blob/master/96%20NEX%20Analysis%20-%20Full%20Options.ipynb (|OCEANS|, |ELLIPTICITY|, |TOPOGRAPHY|, |GRAVITY|, |ROTATION|, |ATTENUATION|), but this does not correct things.

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-52012264.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

komatits commented 10 years ago

And with xlf you can try this:

-g -O0 -C -qddim -qfullpath -qflttrap=overflow:zerodivide:invalid:enable -qfloat=nans -qinitauto=7FBFFFFF

You can also try -qsave and -qnosave.

Also make sure you have

ulimit -S -s unlimited

in your .bashrc

Dimitri.

On 13/08/2014 11:33, Dimitri Komatitsch wrote:

Did you try running your example (in particular the mesher) with ifort -check all -debug -g -O0 -fp-stack-check -traceback -ftrapuv ?

Dimitri.

On 13/08/2014 07:49, Elliott Sales de Andrade wrote:

I made one last try with all the options enabled http://nbviewer.ipython.org/github/QuLogic/ak135-test/blob/master/96%20NEX%20Analysis%20-%20Full%20Options.ipynb

(|OCEANS|, |ELLIPTICITY|, |TOPOGRAPHY|, |GRAVITY|, |ROTATION|, |ATTENUATION|), but this does not correct things.

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-52012264.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

QuLogic commented 10 years ago

I think I may have tried that with ifort, but I've set it up again to see the results.

With xlf, the mesher runs without issue. The solver crashes at this point:

   0:#
   0:+++ID Node 0 Process 299146 Thread 1
   0:***FAULT "SIGTRAP - Trace trap"
   0:+++STACK
   0:compute_forces_crust_mantle_dev : 338 # in file <.../src/specfem3D/compute_forces_crust_mantle_Dev.F90>
   0:compute_forces_viscoelastic : 84 # in file <.../src/specfem3D/compute_forces_viscoelastic_calling_routine.F90>
   0:iterate_time : 165 # in file <.../src/specfem3D/iterate_time.F90>
   0:xspecfem3d : 473 # in file <.../src/specfem3D/specfem3D.F90>
   0:---STACK
   0:---ID Node 0 Process 299146 Thread 1
   0:#
QuLogic commented 10 years ago

Hmm, I notice that the block in which xlf fails uses forced vectorization. I am going to try once more with vectorization disabled and see if that corrects matters.

komatits commented 10 years ago

It will (to use range checking for arrays, i.e. to use the debugging options I sent you, you need to turn force_vectorization off because force_vectorization purposely goes beyond array bounds in order to use large 1D loops instead of three smaller nested loops).

On 13/08/2014 23:14, Elliott Sales de Andrade wrote:

Hmm, I notice that the block in which |xlf| fails uses forced vectorization. I am going to try once more with vectorization disabled and see if that corrects matters.

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-52112140.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

QuLogic commented 10 years ago

Running without vectorization and with debug options seem to show the same differences. There were no errors when the debug options were added. Will try with some save options...

QuLogic commented 10 years ago

Using various save options (-fno-automatic on gfortran, -save on ifort, -qsave on xlf) produces the same inconsistency.

QuLogic commented 10 years ago

Tried with v5.1.5. In that case, it looks like everything matches except ifort. That's a pretty ancient version though, so it may take a while to narrow down the problem.

komatits commented 10 years ago

Are you using version 13.0 or 14.0? If so, it is probably a compiler bug; that compiler often has bugs in dot zero versions, thus we only use dot one or above. I have seen that problem many times in the past.

On 15/08/2014 07:52, Elliott Sales de Andrade wrote:

Tried with v5.1.5 http://nbviewer.ipython.org/github/QuLogic/ak135-test/blob/master/240%20NEX%20Analysis%20-%20v5.1.5.ipynb. In that case, it looks like everything matches /except/ |ifort|.

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-52276444.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

komatits commented 10 years ago

You can also try your current version but with -O0 or -O1 instead of -O3 or -fast; if that fixes the problem then switch back to an older (dot one or above) release of the compiler and you should be fine.

On 15/08/2014 22:41, Dimitri Komatitsch wrote:

Are you using version 13.0 or 14.0? If so, it is probably a compiler bug; that compiler often has bugs in dot zero versions, thus we only use dot one or above. I have seen that problem many times in the past.

On 15/08/2014 07:52, Elliott Sales de Andrade wrote:

Tried with v5.1.5 http://nbviewer.ipython.org/github/QuLogic/ak135-test/blob/master/240%20NEX%20Analysis%20-%20v5.1.5.ipynb.

In that case, it looks like everything matches /except/ |ifort|.

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-52276444.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

QuLogic commented 10 years ago

ifort 12.1.3.293 Build 20120212, but I'll see if 13 or 14 show any difference.

QuLogic commented 9 years ago

I "finished" a bisect, but it unfortunately ended up in the un-compilable sunflower branch (I wrote a script to test compiling them this time, with nearly no success.)

So the issue seems to have arisen between c97fe608732836f956043f41d3e8085c474d4ffe (v5.1.3 + 1 commit) and d1de5ba7ee2da8b5bdcfaca1f561755e8a1f6be6 (very close to sunflower's merge). I guess I'll have to go through the diff by hand now. It's probably some sort of uninitialized variable similar to the problem with s362ani.

QuLogic commented 9 years ago

While I was unable to pinpoint the cause with the bisect, I thought it might be interesting to look at the effect on run time over the tested period. Note, the xlf and xlf_strict results were performed on a different machine, so you shouldn't compare them directly with the others. I used the default flags for everything except xlf_strict where I changed -qnostrict to -qstrict. Also, these should be taken as informational only, since I didn't do any scientific testing (like repeating the tests to make sure there weren't any random node issues).

Here are the run times (seconds per step) for NEX=96: runtime-96 I'm not sure what's up with that first result, but things seem consistent over the years. gfortran seems a bit worse at the end there. Maybe ifort a little bit, too. Could be noise on that one, though.

Here are run times for NEX=144: runtime-144 Again pretty consistent; no obvious slowdowns.

Here are run times for NEX=192 and 240: runtime-192 runtime-240 For these two, they took longer, so once I knew they were or weren't working, I cancelled them. The incomplete jobs are marked with an x instead of an o. I also didn't bother to run as many of the revisions once I had the smaller results, so there are fewer points on these graphs. Things are still pretty consistent though (except that one odd xlf result).

Again, these were not really scientific tests, but at least to a first approximation, it looks like run times are fairly consistent over the development period.

komatits commented 9 years ago

Hi Elliott,

You could two things:

1/ compute the seismograms with Mineos or DSM for ak135 to see which of the seismograms you show at https://github.com/geodynamics/specfem3d_globe/issues/202 are correct and which ones are wrong

2/ send me the Par_file, CMTSOLUTION and STATIONS for NEX = 96 and for NEX = 240, I will run them on my machines (with ifort v13 and with gfortran) and will send you the seismograms

PS: why do you compare CPU times in the curves below; I guess only seismograms matter, the CPU time taken by the mesher seems irrelevant to track this bug (?), considering that the variations you get are very small

Thanks, Dimitri.

On 09/18/2014 09:28 AM, Elliott Sales de Andrade wrote:

While I was unable to pinpoint the cause with the bisect, I thought it might be interesting to look at the effect on run time over the tested period. Note, the |xlf| and |xlf_strict| results were performed on a different machine, so you shouldn't compare them directly with the others. I used the default flags for everything except |xlf_strict| where I changed |-qnostrict| to |-qstrict|. Also, these should be taken as /informational/ only, since I didn't do any scientific testing (like repeating the tests to make sure there weren't any random node issues).

Here are the run times (seconds per step) for NEX=96: runtime-96 https://cloud.githubusercontent.com/assets/302469/4316206/ad2798d8-3f03-11e4-96d6-363134a53c3e.png I'm not sure what's up with that first result, but things seem consistent over the years. gfortran seems a bit worse at the end there. Maybe ifort a little bit, too. Could be noise on that one, though.

Here are run times for NEX=144: runtime-144 https://cloud.githubusercontent.com/assets/302469/4316213/bf39052a-3f03-11e4-9ae7-d2cd77d358a1.png Again pretty consistent; no obvious slowdowns.

Here are run times for NEX=192 and 240: runtime-192 https://cloud.githubusercontent.com/assets/302469/4316214/c3a7be58-3f03-11e4-864e-3cdf5ee13df1.png runtime-240 https://cloud.githubusercontent.com/assets/302469/4316215/c71020ee-3f03-11e4-85af-37d3f1483ab3.png For these two, they took longer, so once I knew they were or weren't working, I cancelled them. The incomplete jobs are marked with an |x| instead of an |o|. I also didn't bother to run as many of the revisions once I had the smaller results. Things are still pretty consistent though (except that one odd |xlf| result).

Again, these were not really scientific tests, but at least to a first approximation, it looks like run times are fairly consistent over the development period.

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-56004038.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

QuLogic commented 9 years ago

1/ compute the seismograms with Mineos or DSM for ak135 to see which of the seismograms you show at https://github.com/geodynamics/specfem3d_globe/issues/202 are correct and which ones are wrong

Yes, I might try this at some point. Mineos needs a bit of cleanup first, though.

2/ send me the Par_file, CMTSOLUTION and STATIONS for NEX = 96 and for NEX = 240, I will run them on my machines (with ifort v13 and with gfortran) and will send you the seismograms

They are available on the mentioned repository.

PS: why do you compare CPU times in the curves below; I guess only seismograms matter, the CPU time taken by the mesher seems irrelevant to track this bug (?), considering that the variations you get are very small Thanks, Dimitri.

Just general interest, since I tested out 2+ years of work. Of course, it doesn't have any bearing on the results.

komatits commented 9 years ago

Hi Elliott,

OK, thanks. I am going to run the two cases on my clusters and send you the gfortran, pgf90 and ifort seismograms tomorrow.

If Mineos is uneasy to use you can use GEMINI (I can send you the source code, it is very clean and easy to use). DSM from Prof. Takeuchi's web page is also clean and easy to use, however if I remember correctly it requires an input model given as polynomials, and (if I also remember correctly) ak135 is not, it is pointwise; however a colleague of mine has created an interpolated polynomial version of ak135 that I have somewhere and that I could send you if you want.

Dimitri.

On 09/23/2014 08:11 AM, Elliott Sales de Andrade wrote:

1/ compute the seismograms with Mineos or DSM for ak135 to see which
of the seismograms you show at #202
<https://github.com/geodynamics/specfem3d_globe/issues/202> are
correct and which ones are wrong

Yes, I might try this at some point. Mineos needs a bit of cleanup first, though.

2/ send me the Par_file, CMTSOLUTION and STATIONS for NEX = 96 and
for NEX = 240, I will run them on my machines (with ifort v13 and
with gfortran) and will send you the seismograms

They are available on the mentioned repository https://github.com/QuLogic/ak135-test.

PS: why do you compare CPU times in the curves below; I guess only
seismograms matter, the CPU time taken by the mesher seems
irrelevant to track this bug (?), considering that the variations
you get are very small Thanks, Dimitri.

Just general interest, since I tested out 2+ years of work. Of course, it doesn't have any bearing on the results.

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-56479777.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

komatits commented 9 years ago

Fixed by Elliott @QuLogic in https://github.com/geodynamics/specfem3d_globe/pull/273