Closed QuLogic closed 9 years ago
Just to be clear, I ran these tests with 2a2b3303e28ec46064fd33efa5d055aef2f5f442, which should be the latest commit on the devel branch, but I also saw the differences with previous commits. You can also obtain the Par_file, CMTSOLUTION, and STATIONS from the mentioned repository.
Tried decreasing DT to 25% of default, but no luck there...
Tried swapping databases between ifort
and gfortran
. No dice. Also wrote some scripts to compare the databases and don't see anything major. So the mesher appears to be working the same in both cases.
Any other ideas, @komatits?
I made one last try with all the options enabled (OCEANS
, ELLIPTICITY
, TOPOGRAPHY
, GRAVITY
, ROTATION
, ATTENUATION
), but this does not correct things.
Did you try running your example (in particular the mesher) with ifort -check all -debug -g -O0 -fp-stack-check -traceback -ftrapuv ?
Dimitri.
On 13/08/2014 07:49, Elliott Sales de Andrade wrote:
I made one last try with all the options enabled http://nbviewer.ipython.org/github/QuLogic/ak135-test/blob/master/96%20NEX%20Analysis%20-%20Full%20Options.ipynb (|OCEANS|, |ELLIPTICITY|, |TOPOGRAPHY|, |GRAVITY|, |ROTATION|, |ATTENUATION|), but this does not correct things.
— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-52012264.
Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr
And with xlf you can try this:
-g -O0 -C -qddim -qfullpath -qflttrap=overflow:zerodivide:invalid:enable -qfloat=nans -qinitauto=7FBFFFFF
You can also try -qsave and -qnosave.
Also make sure you have
ulimit -S -s unlimited
in your .bashrc
Dimitri.
On 13/08/2014 11:33, Dimitri Komatitsch wrote:
Did you try running your example (in particular the mesher) with ifort -check all -debug -g -O0 -fp-stack-check -traceback -ftrapuv ?
Dimitri.
On 13/08/2014 07:49, Elliott Sales de Andrade wrote:
I made one last try with all the options enabled http://nbviewer.ipython.org/github/QuLogic/ak135-test/blob/master/96%20NEX%20Analysis%20-%20Full%20Options.ipynb
(|OCEANS|, |ELLIPTICITY|, |TOPOGRAPHY|, |GRAVITY|, |ROTATION|, |ATTENUATION|), but this does not correct things.
— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-52012264.
Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr
I think I may have tried that with ifort
, but I've set it up again to see the results.
With xlf
, the mesher runs without issue. The solver crashes at this point:
0:#
0:+++ID Node 0 Process 299146 Thread 1
0:***FAULT "SIGTRAP - Trace trap"
0:+++STACK
0:compute_forces_crust_mantle_dev : 338 # in file <.../src/specfem3D/compute_forces_crust_mantle_Dev.F90>
0:compute_forces_viscoelastic : 84 # in file <.../src/specfem3D/compute_forces_viscoelastic_calling_routine.F90>
0:iterate_time : 165 # in file <.../src/specfem3D/iterate_time.F90>
0:xspecfem3d : 473 # in file <.../src/specfem3D/specfem3D.F90>
0:---STACK
0:---ID Node 0 Process 299146 Thread 1
0:#
Hmm, I notice that the block in which xlf
fails uses forced vectorization. I am going to try once more with vectorization disabled and see if that corrects matters.
It will (to use range checking for arrays, i.e. to use the debugging options I sent you, you need to turn force_vectorization off because force_vectorization purposely goes beyond array bounds in order to use large 1D loops instead of three smaller nested loops).
On 13/08/2014 23:14, Elliott Sales de Andrade wrote:
Hmm, I notice that the block in which |xlf| fails uses forced vectorization. I am going to try once more with vectorization disabled and see if that corrects matters.
— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-52112140.
Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr
Running without vectorization and with debug options seem to show the same differences. There were no errors when the debug options were added. Will try with some save
options...
Using various save options (-fno-automatic
on gfortran, -save
on ifort, -qsave
on xlf) produces the same inconsistency.
Tried with v5.1.5. In that case, it looks like everything matches except ifort
. That's a pretty ancient version though, so it may take a while to narrow down the problem.
Are you using version 13.0 or 14.0? If so, it is probably a compiler bug; that compiler often has bugs in dot zero versions, thus we only use dot one or above. I have seen that problem many times in the past.
On 15/08/2014 07:52, Elliott Sales de Andrade wrote:
Tried with v5.1.5 http://nbviewer.ipython.org/github/QuLogic/ak135-test/blob/master/240%20NEX%20Analysis%20-%20v5.1.5.ipynb. In that case, it looks like everything matches /except/ |ifort|.
— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-52276444.
Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr
You can also try your current version but with -O0 or -O1 instead of -O3 or -fast; if that fixes the problem then switch back to an older (dot one or above) release of the compiler and you should be fine.
On 15/08/2014 22:41, Dimitri Komatitsch wrote:
Are you using version 13.0 or 14.0? If so, it is probably a compiler bug; that compiler often has bugs in dot zero versions, thus we only use dot one or above. I have seen that problem many times in the past.
On 15/08/2014 07:52, Elliott Sales de Andrade wrote:
Tried with v5.1.5 http://nbviewer.ipython.org/github/QuLogic/ak135-test/blob/master/240%20NEX%20Analysis%20-%20v5.1.5.ipynb.
In that case, it looks like everything matches /except/ |ifort|.
— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-52276444.
Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr
ifort
12.1.3.293 Build 20120212, but I'll see if 13 or 14 show any difference.
I "finished" a bisect, but it unfortunately ended up in the un-compilable sunflower branch (I wrote a script to test compiling them this time, with nearly no success.)
So the issue seems to have arisen between c97fe608732836f956043f41d3e8085c474d4ffe (v5.1.3 + 1 commit) and d1de5ba7ee2da8b5bdcfaca1f561755e8a1f6be6 (very close to sunflower's merge). I guess I'll have to go through the diff by hand now. It's probably some sort of uninitialized variable similar to the problem with s362ani.
While I was unable to pinpoint the cause with the bisect, I thought it might be interesting to look at the effect on run time over the tested period. Note, the xlf
and xlf_strict
results were performed on a different machine, so you shouldn't compare them directly with the others. I used the default flags for everything except xlf_strict
where I changed -qnostrict
to -qstrict
. Also, these should be taken as informational only, since I didn't do any scientific testing (like repeating the tests to make sure there weren't any random node issues).
Here are the run times (seconds per step) for NEX=96: I'm not sure what's up with that first result, but things seem consistent over the years. gfortran seems a bit worse at the end there. Maybe ifort a little bit, too. Could be noise on that one, though.
Here are run times for NEX=144: Again pretty consistent; no obvious slowdowns.
Here are run times for NEX=192 and 240:
For these two, they took longer, so once I knew they were or weren't working, I cancelled them. The incomplete jobs are marked with an x
instead of an o
. I also didn't bother to run as many of the revisions once I had the smaller results, so there are fewer points on these graphs. Things are still pretty consistent though (except that one odd xlf
result).
Again, these were not really scientific tests, but at least to a first approximation, it looks like run times are fairly consistent over the development period.
Hi Elliott,
You could two things:
1/ compute the seismograms with Mineos or DSM for ak135 to see which of the seismograms you show at https://github.com/geodynamics/specfem3d_globe/issues/202 are correct and which ones are wrong
2/ send me the Par_file, CMTSOLUTION and STATIONS for NEX = 96 and for NEX = 240, I will run them on my machines (with ifort v13 and with gfortran) and will send you the seismograms
PS: why do you compare CPU times in the curves below; I guess only seismograms matter, the CPU time taken by the mesher seems irrelevant to track this bug (?), considering that the variations you get are very small
Thanks, Dimitri.
On 09/18/2014 09:28 AM, Elliott Sales de Andrade wrote:
While I was unable to pinpoint the cause with the bisect, I thought it might be interesting to look at the effect on run time over the tested period. Note, the |xlf| and |xlf_strict| results were performed on a different machine, so you shouldn't compare them directly with the others. I used the default flags for everything except |xlf_strict| where I changed |-qnostrict| to |-qstrict|. Also, these should be taken as /informational/ only, since I didn't do any scientific testing (like repeating the tests to make sure there weren't any random node issues).
Here are the run times (seconds per step) for NEX=96: runtime-96 https://cloud.githubusercontent.com/assets/302469/4316206/ad2798d8-3f03-11e4-96d6-363134a53c3e.png I'm not sure what's up with that first result, but things seem consistent over the years. gfortran seems a bit worse at the end there. Maybe ifort a little bit, too. Could be noise on that one, though.
Here are run times for NEX=144: runtime-144 https://cloud.githubusercontent.com/assets/302469/4316213/bf39052a-3f03-11e4-9ae7-d2cd77d358a1.png Again pretty consistent; no obvious slowdowns.
Here are run times for NEX=192 and 240: runtime-192 https://cloud.githubusercontent.com/assets/302469/4316214/c3a7be58-3f03-11e4-864e-3cdf5ee13df1.png runtime-240 https://cloud.githubusercontent.com/assets/302469/4316215/c71020ee-3f03-11e4-85af-37d3f1483ab3.png For these two, they took longer, so once I knew they were or weren't working, I cancelled them. The incomplete jobs are marked with an |x| instead of an |o|. I also didn't bother to run as many of the revisions once I had the smaller results. Things are still pretty consistent though (except that one odd |xlf| result).
Again, these were not really scientific tests, but at least to a first approximation, it looks like run times are fairly consistent over the development period.
— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-56004038.
Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr
1/ compute the seismograms with Mineos or DSM for ak135 to see which of the seismograms you show at https://github.com/geodynamics/specfem3d_globe/issues/202 are correct and which ones are wrong
Yes, I might try this at some point. Mineos needs a bit of cleanup first, though.
2/ send me the Par_file, CMTSOLUTION and STATIONS for NEX = 96 and for NEX = 240, I will run them on my machines (with ifort v13 and with gfortran) and will send you the seismograms
They are available on the mentioned repository.
PS: why do you compare CPU times in the curves below; I guess only seismograms matter, the CPU time taken by the mesher seems irrelevant to track this bug (?), considering that the variations you get are very small Thanks, Dimitri.
Just general interest, since I tested out 2+ years of work. Of course, it doesn't have any bearing on the results.
Hi Elliott,
OK, thanks. I am going to run the two cases on my clusters and send you the gfortran, pgf90 and ifort seismograms tomorrow.
If Mineos is uneasy to use you can use GEMINI (I can send you the source code, it is very clean and easy to use). DSM from Prof. Takeuchi's web page is also clean and easy to use, however if I remember correctly it requires an input model given as polynomials, and (if I also remember correctly) ak135 is not, it is pointwise; however a colleague of mine has created an interpolated polynomial version of ak135 that I have somewhere and that I could send you if you want.
Dimitri.
On 09/23/2014 08:11 AM, Elliott Sales de Andrade wrote:
1/ compute the seismograms with Mineos or DSM for ak135 to see which of the seismograms you show at #202 <https://github.com/geodynamics/specfem3d_globe/issues/202> are correct and which ones are wrong
Yes, I might try this at some point. Mineos needs a bit of cleanup first, though.
2/ send me the Par_file, CMTSOLUTION and STATIONS for NEX = 96 and for NEX = 240, I will run them on my machines (with ifort v13 and with gfortran) and will send you the seismograms
They are available on the mentioned repository https://github.com/QuLogic/ak135-test.
PS: why do you compare CPU times in the curves below; I guess only seismograms matter, the CPU time taken by the mesher seems irrelevant to track this bug (?), considering that the variations you get are very small Thanks, Dimitri.
Just general interest, since I tested out 2+ years of work. Of course, it doesn't have any bearing on the results.
— Reply to this email directly or view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/202#issuecomment-56479777.
Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr
Fixed by Elliott @QuLogic in https://github.com/geodynamics/specfem3d_globe/pull/273
Upon switching systems (and compilers), I've noticed that the resulting seismograms do not seem to match any more. The change was from a system using
xlf
to one usingifort
.I originally suspected the
-qnostrict
option toxlf
was the cause of the discrepancy, but further testing shows that while this option makes a difference, it is not the consistent cause of the problem. The results appear to vary depending on theNEX
chosen.I've run tests against four compiler (options):
ifort
12.1.3 20120212gfortran
4.6.1xlf
V12.1 12.01.0000.0009 with-qnostrict
(the default inflags.guess
)xlf
with-qstrict
The results are as follows:
NEX=96
, at 101°, on the North component, bothifort
andxlf -qnostrict
match (middle plot), butgfortran
andxlf -qstrict
are the negative of the first two (top and bottom plots).NEX=144
, but there were no immediately visible differences.NEX=192
, at 101°, on the North component, onlyxlf -qnostrict
(middle plot) shows any difference, but it is very small. You can see it most clearly at the extrema of the waveform.NEX=240
, at 90°, the results are much worse.xlf -qnostrict
(middle plot), but the rest appear the same:ifort
is the negative of all the other compilers tested:You can view the analysis and some additional plots here: NEX=96, NEX=144, NEX=192, NEX=240 or at this repository. I did not upload all the waveforms since they take up about 3.5 GB.