Issues with Intel Compiler

bjoo commented 11 years ago

I have had several issue reports with Intel compiler, probably as relates to QDP++ under Chroma (perhaps I should move / cross list this issue to a QDP++ tracker if we ever get one):

Brendan Fahy reported this in March:

I found a very strange bug. In lib/actions/gauge/gaugeacts/plaq_gaugeact.cc when compiled with the intel compiler gives completely wrong results for the backwards staple. I was getting completely wrong results when running with an intel compiler and was able to track it down to the staple function in the wilson action. Very shocked, but rewriting the multiplication using an extra temp variable seems to sort out whatever the intel compiler was doing wrong.

This does not yet implicate the high optimization level (sent query to Brendan) but Jie also had a similar issue which we did track to using -O3. This made me think that Brendan's issue may have also been due to -O3 vs -O2.

Finally, Will Detmold reported incorrect solver convergence (and in fact nonconvergence) on Edison at NERSC, using a configuration created on Intrepid (Argonne BG/P). He was trying to continue the run. He tried a variety of optimization combinations:

(op qmp) + (op qdp++) + (op chroma + sseBICGkernels) = BAD (unop qmp) + (unop qdp++) + (unop chroma) = GOOD (op qmp) + (op qdp++) + (unop chroma) = BAD (op qmp) + (unop qdp++) + (unop chroma) = GOOD
(op qmp) + (unop qdp++) + (op chroma + sseBICGkernels) = GOOD (op qmp) + (partop qdp++) + (op chroma + sseBICGkernels) = GOOD

op means with -O3 (and -ax=avx for chroma and qdp++)

partop qdp++ means with see but without -O3

Seems like it is QDP++ where the issue is when -O3 is used

I would add to this that I suspect the issue is ina .cc file in QDP++, since if it was in a .h file, then the (op Chroma) would likely also be bad.

However, Will's test hopefully used a double prec build. I don't know if Jie and Brendan saw this issue in double prec or not.

bjoo commented 11 years ago

NB: I was trying this on our cluster, and with the intel compiler, using a double precision build with --enable-sse2 --enable-sse3 in QDP++ and in Chroma Dslash, several regressions go wrong, I am investigating this with enabling/disabling sse, OpenMP, in QDP++ and Chroma to track it.

t_leapfrog FAIL t_leapfrog.prec_1flav_clover.candidate.xml t_leapfrog FAIL t_leapfrog.prec_clover_stout-rel-cg-multiprec.candidate.xml t_leapfrog FAIL t_leapfrog.prec_clover_stout-cg-lf-clover.candidate.xml t_leapfrog FAIL t_leapfrog.prec_clover_stout-richardson-multiprec.candidate.xml t_leapfrog FAIL t_leapfrog.prec_clover_stout-rel-bicgstab-multiprec.candidate.xml t_leapfrog FAIL t_leapfrog.prec_clover_stout-rel-ibicgstab-multiprec.candidate.xml t_leapfrog FAIL t_leapfrog.prec_clover_stout-ibicgstab.candidate.xml t_leapfrog FAIL t_leapfrog.sts_min_norm_2_dtau.candidate.xml t_leapfrog FAIL t_leapfrog.tst_min_norm_2_dtau.candidate.xml t_leapfrog FAIL t_leapfrog.unprec_clover.candidate.xml t_leapfrog FAIL t_leapfrog.prec_2flav_clover.candidate.xml t_leapfrog FAIL t_leapfrog.prec_2flav_clover.sfnonpt.candidate.xml t_leapfrog FAIL t_leapfrog.lw.sfnonpt.candidate.xml t_leapfrog FAIL t_leapfrog.prec_2flav_clover.ee_oo_candidate.xml t_leapfrog FAIL t_leapfrog.rect_gaugeact.candidate.xml t_leapfrog FAIL t_leapfrog.rect_gaugeact_1.candidate.xml t_leapfrog FAIL t_leapfrog.rect_gaugeact_c1t0.candidate.xml t_leapfrog FAIL t_leapfrog.rect_gaugeact_omit2linkT.candidate.xml t_leapfrog FAIL t_leapfrog.rect_gaugeact_aniso.candidate.xml t_leapfrog FAIL t_leapfrog.two_plaq_spatial_gaugeact.candidate.xml t_leapfrog FAIL t_leapfrog.aniso_spectrum.candidate.xml t_leapfrog FAIL t_leapfrog.prec_clover_stout.candidate.xml t_leapfrog FAIL t_leapfrog.prec_slrc.candidate.xml t_leapfrog FAIL t_leapfrog.prec_slrc.sfnonpt.candidate.xml t_leapfrog FAIL t_leapfrog.aniso_sym_spatial_plus_temporal.log.xml t_leapfrog FAIL t_leapfrog.aniso_sym_spatial.candidate.xml t_leapfrog FAIL t_leapfrog.aniso_sym_temporal.candidate.xml purgaug FAIL purgaug.candidate.xml purgaug FAIL purgaug.sfnonpt.candidate.xml purgaug FAIL purgaug.2+2.candidate.xml purgaug FAIL purgaug.2+2.1loop.candidate.xml

bjoo commented 11 years ago

NB: The various propagators for the smearing combinations all PASS so likely the issue is in the Force term / Gauge term (similar to what Brendan Fahy spotted/reported).

bjoo commented 11 years ago

OK. Disabling sse2 and sse3 in QDP++ only (leave it on in Chroma) seems good (all those tests now pass)

bjoo commented 11 years ago

OK, I've made fixed some of this in QDP++ commit dfb24b4fe525bbe2b4e71dd71ba01bd93dfa66ac Now parscalar-single-intel and parscalar-single-intel double regressions pass.

The issue had to do with an initialization of the __m128d type using a union in this fashion.

typedef union {
__m128d v; double d[2]; } VD;

One could then do

VD x = { a, b };

however in the CCMUL and CCMADD macros something went awry. I changed the intialization instead to just

__m128d x = _mm_set_pd( b, a );

and of course changed instances of 'x.v' to just 'x'

I am not sure why this generated the wrong code originally, since many of the tests still use this kind of initialization. NB: This has the potential to hurt expressions of the form

adj(m1)*adj(m2)

like seen by Brendan Fahy, so this fix may solve his issues too.

JeffersonLab / chroma

Issues with Intel Compiler #1