Problems compiling with SSE intrinsics

Irubataru commented 6 years ago

I'm having problems compiling tmLQCD with SSE intrinsics on my computer. I run configure with the following arguments

${dir}/srcs/tmLQCD/configure \
  --prefix=$HOME/.usr \
  --enable-mpi \
  --enable-sse2 \
  --enable-sse3 \
  --with-mpidimension=4 \
  --enable-gaugecopy \
  --with-limedir=$HOME/.usr \
  --with-lapack="-L/opt/intel/mkl/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl" \
  CC=mpicc CXX=mpicxx

The config script is fine, but when I run make I get a lot of error message along the lines of

In file included from ${dir}/srcs/tmLQCD/global.h:61:0,
                 from ${dir}/srcs/tmLQCD/operator/tm_operators.c:32:
${dir}/srcs/tmLQCD/operator/mul_one_pm_imu_sub_mul_body.c: In function ‘mul_one_pm_imu_sub_mul’:
${dir}/srcs/tmLQCD/sse.h:276:1: error: memory input 0 is not directly addressable
 __asm__ __volatile__ ("movapd %0, %%xmm0 \n\t" \
 ^
${dir}/srcs/tmLQCD/su3.h:176:1: note: in expansion of macro ‘_sse_load’
 _sse_load(s1); \
 ^
${dir}/srcs/tmLQCD/operator/mul_one_pm_imu_sub_mul_body.c:38:5: note: in expansion of macro ‘_vector_su
’
     _vector_sub(t->s0, phi1, s->s0);
     ^
${dir}/srcs/tmLQCD/sse.h:276:1: error: memory input 1 is not directly addressable
 __asm__ __volatile__ ("movapd %0, %%xmm0 \n\t" \
 ^
${dir}/srcs/tmLQCD/su3.h:176:1: note: in expansion of macro ‘_sse_load’
 _sse_load(s1); \
 ^
${dir}/srcs/tmLQCD/operator/mul_one_pm_imu_sub_mul_body.c:38:5: note: in expansion of macro ‘_vector_su
’
     _vector_sub(t->s0, phi1, s->s0);
     ^
${dir}/srcs/tmLQCD/sse.h:276:1: error: memory input 2 is not directly addressable
 __asm__ __volatile__ ("movapd %0, %%xmm0 \n\t" \
 ^
${dir}/srcs/tmLQCD/su3.h:176:1: note: in expansion of macro ‘_sse_load’
 _sse_load(s1); \
 ^
${dir}/srcs/tmLQCD/operator/mul_one_pm_imu_sub_mul_body.c:38:5: note: in expansion of macro ‘_vector_su
’
     _vector_sub(t->s0, phi1, s->s0);
     ^
${dir}/srcs/tmLQCD/sse.h:276:1: error: memory input 0 is not directly addressable
 __asm__ __volatile__ ("movapd %0, %%xmm0 \n\t" \
 ^

I am compiling this on a machine with an Intel Core i7-6700 using gcc version 5.4.0, and I am not sure what is going wrong. If I compile it without any SSE instructions, it compiles without any problems. I got the exact same error messages compiling it on an Intel Xeon E5-2690 with gcc 5.3.0 with SSE. I am compiling git rev 946cfdf.

I have the full logs available here.

martin-ueding commented 6 years ago

The Intel chips have SSE2, SSSE3 and SSE4 but no SSE3 which seems to be exclusive to AMD CPUs. Perhaps that is the issue?

Does lscpu | grep -i sse3 list anything besides ssse3?

Irubataru commented 6 years ago

No, it does not, so yes that seems to be a slight issue. I already tried compiling with SSE2 only, and I get the same error messages. I also tried now with SSE2 and explicitly disabling SSE3, and it is identical.

Irubataru commented 6 years ago

Also, should have mentioned it in the original issue, but there are also compile errors regarding undeclared variables, they are probably also a hint that the macros aren't set properly

mpicc -DHAVE_CONFIG_H -I${HOME}/include/ -I. -I${dir}/builds/tmLQCD/sse2/  -I${dir}/srcs/tmLQCD/ -I${HOME}/.usr/include/ -I/include/ -g -O2 -fopenmp -pedantic -Wall -mfpmath=387 -O -c ${dir}/srcs/tmLQCD/operator/tm_operators.c
In file included from ${dir}/srcs/tmLQCD/global.h:44:0,
                 from ${dir}/srcs/tmLQCD/operator/tm_operators.c:32:
${dir}/srcs/tmLQCD/operator/mul_one_pm_imu_sub_mul_body.c: In function ‘mul_one_pm_imu_sub_mul’:
${dir}/srcs/tmLQCD/operator/mul_one_pm_imu_sub_mul_body.c:33:27: error: ‘phi1’ undeclared (first use in this function)
     _complex_times_vector(phi1, z, r->s0);
                           ^
${dir}/srcs/tmLQCD/su3.h:679:4: note: in definition of macro ‘_complex_times_vector’
    x.c0 = (c) * (y).c0;   \
    ^
${dir}/srcs/tmLQCD/operator/mul_one_pm_imu_sub_mul_body.c:33:27: note: each undeclared identifier is reported only once for each function it appears in
     _complex_times_vector(phi1, z, r->s0);
                           ^
${dir}/srcs/tmLQCD/su3.h:679:4: note: in definition of macro ‘_complex_times_vector’
    x.c0 = (c) * (y).c0;   \
    ^
${dir}/srcs/tmLQCD/operator/mul_one_pm_imu_sub_mul_body.c:34:27: error: ‘phi2’ undeclared (first use in this function)
     _complex_times_vector(phi2, z, r->s1);
                           ^
${dir}/srcs/tmLQCD/su3.h:679:4: note: in definition of macro ‘_complex_times_vector’
    x.c0 = (c) * (y).c0;   \
    ^
${dir}/srcs/tmLQCD/operator/mul_one_pm_imu_sub_mul_body.c:35:27: error: ‘phi3’ undeclared (first use in this function)
     _complex_times_vector(phi3, w, r->s2);
                           ^
${dir}/srcs/tmLQCD/su3.h:679:4: note: in definition of macro ‘_complex_times_vector’
    x.c0 = (c) * (y).c0;   \
    ^
${dir}/srcs/tmLQCD/operator/mul_one_pm_imu_sub_mul_body.c:36:27: error: ‘phi4’ undeclared (first use in this function)
     _complex_times_vector(phi4, w, r->s3);
                           ^
${dir}/srcs/tmLQCD/su3.h:679:4: note: in definition of macro ‘_complex_times_vector’
    x.c0 = (c) * (y).c0;   \
    ^

urbach commented 6 years ago

I can confirm the problem, I'll see how to fix it.

kostrzewa commented 6 years ago

In the meantime, if you have access to an Intel compiler, just disable the SSE flags and use the Intel compiler. Performance is a bit better anyway.

urbach commented 6 years ago

Uhh, no one is using SSE intrinsics anymore it seems. The bug is not present in 763a1fd31fb1f429677f5d1bf496590c4804d9bd

urbach commented 6 years ago

and also not here 74f93b7f470d3f1c6f07edb14d61f0a64346f756

urbach commented 6 years ago

Sorry, trying to locate the problem, its also not yet here ea2b7ab0a064512ecec2468f7fc4b71a22a8f91a

urbach commented 6 years ago

and not here 9e46ed9c6a6213baedf0bd0f6c427e13ddc13423

urbach commented 6 years ago

It seems 439db5281e5bea176ae8b318ea14341fc4081608 is the first commit showing the problem

kostrzewa commented 6 years ago

@urbach if/when you fix this, could you please do so against https://github.com/kostrzewa/tmLQCD/tree/qphix_devel_hmc ? It might be that the interface changes are transparent to this issue, but it might be that they are not.

Irubataru commented 6 years ago

Hmm, I see. In the meantime I compiled it with an intel compiler for the server I'm running it on. I was assuming that enabling SSE would help a bit with performance as it was fairly abysmal without. The intel compiler seems to have sped it up by a factor of two, which is good. However now one of my runs gets stuck in its initial config, with no trajectories accepted. This did not happen before, and I have to admit it leaves me a bit worried. Oh well, I'll try to see if I have done something stupid.

kostrzewa commented 6 years ago

If you post an input file, we can run it if you'd like. FYI: there are big changes coming for running on Intel machines with full support for QPhiX solvers, which speed things up by about a factor of 2 on AVX2 machines and about a factor of 2-4 (depending on parallelisation) on KNL. Also, since you're doing HMC, you should certainly consider using the DDalphaAMG solver to get rid of critical slowing down.

If you want to share, I would be very interested in your plans. (it is a pretty rare occurence to see new "faces" around here, so: welcome!)

kostrzewa commented 6 years ago

As for your acceptance problem, did you create a new build directory? There are some issues with make clean not cleaning up properly, even though a fresh run of configure was done.

kostrzewa commented 6 years ago

ps: you might also want to consider running with --enable-halfspinor, if you plan to go over multiple nodes

Irubataru commented 6 years ago

Sure, I can tell you what I'm up to. Basically I am investigating an apparent discrepancy between chroma and openqcd which I have been struggling with for a while now. So I wanted to try a completely independent code to see which one is "wrong" , as I haven't been able to get that from testing the two source codes.

So my runs are generally not very big or complicated, and the parameters aren't physical. The only thing I want is to run a lattice code with clover and gauge rectangulars either on or off and measure the average plaquette.

This is the in-file I am testing is test13.input, which is a fairly simple adaption of one of the sample ones you provide. Performance is not critical to me, but I have quite a few of these I need to run, and it is a lot easier for my workflow if a test takes half a day (which it does with the two other codes) instead of 1.2 days (which is what it takes for me with a naive compilation with GNU and no SSE).

The plaquette in this case should be around 0.094, but when I run the one I compiled with intel it starts at 0.126 and gets stuck there (the GNU compiled one starts at the right value straight away and has its first step accepted).

Also, yes I made a new build directory. And, no, I run the simulation on a single node. They are only 8x8x8x8 and 20k trajectories.

urbach commented 6 years ago

for the logs, the last commit which is fine is 5f49bd4f15d1babaf75ced121712a1945d118f24

martin-ueding commented 6 years ago

One source for discrepancy between the codes might be some factor that is taken out of β. I had this issue with Chroma, where I wanted to simulate something with β = 3.3 from an European group. Eventually I figured out that for the Lüscher-Weiß tree-level improved action I would need to supply β = 5.5 in order to get the correct result (factor 5/3).

Irubataru commented 6 years ago

The discrepancy currently seems to be in the clover term, but thank you for the suggestion. I have spent quite a while reading and testing components of the source codes now, and it seems to be quite subtle. E.g. when I compare the clover 6x6 matrices, I get the same numbers (to working precision). But I am hoping that running tmLQCD might help clear things up a bit.

But yeah, I was also caught a bit off guard the first time I read through their gauge action and realised they had hard-coded in a specific improvement scheme instead of providing tunable parameters.

urbach commented 6 years ago

our plaquette equals 1 if all gauge fields are unity

urbach commented 6 years ago

For a fix see #394. Please confirm, thanks

Irubataru commented 6 years ago

It does indeed compile and run now, thank you.

Would it be possible for any of you to start the input file I gave with code compiled using the Intel compiler? You should see almost immediately whether you have the same problems as I or not. With a code compiled with the GNU compiler, the first 10 steps are for me

00000000 0.094211403059 0.153055847917 8.580818e-01 95 2820 1 9.943145e+00
00000001 0.092449960395 -0.067434718919 1.069760e+00 40 1274 1 4.797375e+00
00000002 0.092963576018 0.030466057113 9.699934e-01 40 1278 1 4.857091e+00
00000003 0.095645303817 -0.034477439300 1.035079e+00 40 1274 1 4.793442e+00
00000004 0.092947999626 -0.020744363799 1.020961e+00 40 1282 1 4.887091e+00
00000005 0.094965405998 0.039477254498 9.612918e-01 40 1279 1 4.869354e+00
00000006 0.095624417794 -0.016252125987 1.016385e+00 40 1282 1 4.837924e+00
00000007 0.095189793667 0.001982775192 9.980192e-01 40 1277 1 4.846736e+00
00000008 0.093657263717 0.015312612675 9.848040e-01 39 1277 1 4.841713e+00
00000009 0.091946300053 -0.008411047394 1.008447e+00 40 1281 1 4.843856e+00

while for the same code compiled with the Intel compiler and the same input file (and same cluster) I get

00000000 0.126122261074 3684.012770289443 0.000000e+00 96 2571 0 6.552157e+00
00000001 0.126122261074 3348.000831292767 0.000000e+00 38 1265 0 4.307279e+00
00000002 0.126122261074 3436.998691494939 0.000000e+00 39 1272 0 4.311206e+00
00000003 0.126122261074 3453.831385205559 0.000000e+00 38 1249 0 4.300966e+00
00000004 0.126122261074 3611.606860164020 0.000000e+00 39 1237 0 4.297565e+00
00000005 0.126122261074 3036.252916574260 0.000000e+00 39 1258 0 4.320671e+00
00000006 0.126122261074 3295.793037785319 0.000000e+00 39 1235 0 4.297047e+00
00000007 0.126122261074 3430.824307324229 0.000000e+00 39 1245 0 4.301193e+00
00000008 0.126122261074 3192.636978319953 0.000000e+00 38 1230 0 4.292311e+00
00000009 0.126122261074 3339.322717444231 0.000000e+00 38 1234 0 4.295579e+00

I'm just wondering whether I'm doing something stupid or if the problem might be my installation or compilation. This obviously does not relate to the issue itself, so if you'd rather close it then go ahead.

kostrzewa commented 6 years ago

So, I used the Wilson plaquette gauge monomial (instead of the user one) and I get:

00000000 0.096336763057 0.142530017384 8.671615e-01 113 3454 1 2.674270e+01
00000001 0.094102486918 -0.020055026184 1.020257e+00 48 1617 1 1.397440e+01
00000002 0.092439508985 0.092536157590 9.116162e-01 48 1617 1 1.434853e+01
00000003 0.091307252448 -0.010217005896 1.010269e+00 48 1617 1 1.431104e+01
00000004 0.093083556246 -0.003247045978 1.003252e+00 48 1617 1 1.437012e+01
00000005 0.093743322748 0.001130274253 9.988704e-01 48 1617 1 1.358467e+01
00000006 0.095275520940 -0.049523409176 1.050770e+00 48 1617 1 1.421133e+01
00000007 0.092500801963 0.032870415802 9.676639e-01 48 1617 1 1.471422e+01
00000008 0.095852287637 -0.048931282463 1.050148e+00 48 1617 1 1.432282e+01
00000009 0.096112735558 -0.082923961554 1.086459e+00 48 1617 1 1.444882e+01
00000010 0.097327086177 -0.033161894626 1.033718e+00 48 1617 1 1.415402e+01
00000011 0.093038902208 -0.019058199742 1.019241e+00 48 1617 1 1.438139e+01
00000012 0.093132644870 0.035045752303 9.655612e-01 48 1617 1 1.489692e+01
00000013 0.095258635237 0.004465281692 9.955447e-01 48 1617 1 1.686278e+01
00000014 0.095430818433 -0.009752198029 1.009800e+00 48 1617 1 1.450057e+01
00000015 0.098066080771 -0.019231901421 1.019418e+00 48 1617 1 1.533546e+01

This was using QPhiX mixed-precision solvers with the code compiled without SSE using a GNU compiler, just to have a cross-check. Depending on you configure flags, you might get issues with alignment.

@urbach: are you doing the Intel compiler cross-check?

Irubataru commented 6 years ago

Btw, just as a reference. I have also run this input using the code compiled with the Intel compiler, and there I have no issues, so it might not be alignment. The difference between the two are CSW and RectangleCoefficient.

urbach commented 6 years ago

Btw, just as a reference. I have also run this input using the code compiled with the Intel compiler, and there I have no issues, so it might not be alignment. The difference between the two are CSW and RectangleCoefficient.

Sorry, what do you mean by "The difference between the two are CSW and RectangleCoefficient." ?

-- Carsten Urbach e-mail: curbach@gmx.de urbach@hiskp.uni-bonn.de Fon : +49 (0)228 73 2379 skype : carsten.urbach URL: http://www.carsten-urbach.eu

urbach commented 6 years ago

Did you see the comments I made to your first input file?

There are gauge actions taylored for Wilson plaquette and TiSym. You don't need this 'user' type, even though I don't think that it will make a difference.

And you need to set

2kappamu = 0.

if you want clover without twisted mass.

Irubataru commented 6 years ago

Sorry, I did not see the comments until now. I will add the 2kappamu argument where it is missing, thanks for the heads up.

None of this should change the strange behaviour I have with the Intel compiled code, right? (I checked and it didn't change anything for me)

What I meant by "The difference between the two are CSW and RectangleCoefficient." Is that I have run two different configs test12.input and test13.input, and that the only difference between the two is that test12 has no clover term, while test13 have no rectangulars in the gauge action. I have run both inputs with both the GNU compiled and the Intel compiled codes. For test12 they agree, while for test13 I get the results I pasted above.

Sorry if the inputs look a bit strange to you, as I mentioned earlier, I am not trying to do a lattice study of any physical system, it was simply a random selection of parameters I have been using for a couple of tests.

urbach commented 6 years ago

With the intel compiler and test13.input, I obtain

00000000 0.096336763031 0.142530632316 8.671610e-01 96 2810 1 2.317842e+01 00000001 0.094102486893 -0.020055737842 1.020258e+00 40 1287 1 1.111337e+01 00000002 0.092439508965 0.092536500380 9.116159e-01 40 1274 1 1.105958e+01 00000003 0.091307252445 -0.010215360031 1.010268e+00 40 1276 1 1.109741e+01 00000004 0.093083556090 -0.003245009629 1.003250e+00 40 1274 1 1.115518e+01 00000005 0.093743322706 0.001130569216 9.988701e-01 40 1282 1 1.116119e+01 00000006 0.095275520797 -0.049523151264 1.050770e+00 40 1279 1 1.102487e+01 00000007 0.092500801852 0.032869215065 9.676651e-01 40 1274 1 1.100140e+01 00000008 0.095852287830 -0.048930342837 1.050147e+00 40 1274 1 1.097376e+01 00000009 0.096112735075 -0.082924068887 1.086459e+00 40 1274 1 1.098632e+01

urbach commented 6 years ago

With test12.input

00000000 0.154697014287 -0.070011844567 1.072521e+00 76 2300 1 2.123997e+01 1.844078e-02 00000001 0.160085060706 0.006687546100 9.933348e-01 31 1029 1 1.015817e+01 1.889369e-02 00000002 0.159087213411 0.010274955803 9.897777e-01 31 1029 1 1.014312e+01 1.816977e-02 00000003 0.158633089718 0.005527105281 9.944881e-01 31 1029 1 1.018453e+01 1.916364e-02

urbach commented 6 years ago

Looks all okay to me...

Irubataru commented 6 years ago

Must be something wrong with my compilation then, thank you. I will keep digging.

kostrzewa commented 6 years ago

When compiling with the Intel compiler on an AVX2 machine (and using -xCORE-AVX2 as an optimisation target), I'm guessing that the flag --enable-alignment=32 should be passed, otherwise correctness cannot be guaranteed.

Here's a typical configure line for a hybrid OpenMP/MPI build and statically linked MKL:

configure --with-limedir=${LIMEDIR} --with-mpidimension=4 \
--enable-omp --enable-mpi --disable-sse2 --disable-sse3 \
--with-lapack="-Wl,--start-group /usr/local/software/jureca/Stage3/software/Toolchain/ipsmpi/2015.07/imkl/11.2.3.187/mkl/lib/intel64/libmkl_intel_lp64.a /usr/local/software/jureca/Stage3/software/Toolchain/ipsmpi/2015.07/imkl/11.2.3.187/mkl/lib/intel64/libmkl_core.a /usr/local/software/jureca/Stage3/software/Toolchain/ipsmpi/2015.07/imkl/11.2.3.187/mkl/lib/intel64/libmkl_intel_thread.a -Wl,--end-group -lpthread -lm" \
--enable-halfspinor --enable-gaugecopy \
CC=mpicc CFLAGS="-O3 -xCORE-AVX2 -std=c99 -qopenmp"

urbach commented 6 years ago

Do you have any clue what is going on with the Intel compiler?

Irubataru commented 6 years ago

I still haven't tried adding the alignment flags to the compilation. I can do it when I have the time. I did eventually get the results I needed, so I haven't spent any more time troubleshooting. But regardless, the main issue has been solved as the SSE flags work now.

etmc / tmLQCD

Problems compiling with SSE intrinsics #393