struct-based data layout seems to break due to padding

kostrzewa commented 6 years ago

@urbach @martin-ueding

There are indications that when compiling with -xMIC-AVX512, parts of tmLQCD produce wrong results. These could be due to communicators, the gauge copy, the half-spinor kernel or any number of other things which treat the structs as flat arrays.

The worst thing is that tmLQCD doesn't notice that it's doing something wrong:

Inversion on KNL with `-xMIC-AVX512`:

# Using even/odd preconditioning!
# Using Mixed Precision CG!
#RG_Mixed CG: Using relative precision! eps_sq: 1e-14 target_eps_sq: 3.88378e-15 
#RG_Mixed CG: N_outer: 10 
# RG_mixed CG: iter_out: 4 iter_in_sp: 35 iter_in_dp: 0
#RG_mixed CG: iter: 39 eps_sq: 1.0000e-14 t/s: 2.4459e+00
# FIXME: note the following flop counts are wrong! Consider only the time to solution!
#RG_mixed CG: flopcount (for e/o tmWilson only): t/s: 2.4459e+00 mflops_local: 441.9 mflops: 7071.2
# Constructing LEMON writer for file source.0600.00.00.inverted for append = 0
# Time spent writing 12.6 Mb was 87.3 ms.
# Writing speed: 144 Mb/s (9.01 Mb/s per MPI process).
# Inversion done in 39 iterations, squared residue = 1.800069e-15!

Identical inversion on KNL with `-xCORE-AVX2`

# Using even/odd preconditioning!
# Using Mixed Precision CG!
#RG_Mixed CG: Using relative precision! eps_sq: 1e-14 target_eps_sq: 3.88378e-15 
#RG_Mixed CG: N_outer: 8 
# RG_mixed CG: iter_out: 3 iter_in_sp: 284 iter_in_dp: 0
#RG_mixed CG: iter: 287 eps_sq: 1.0000e-14 t/s: 3.9596e+00
# FIXME: note the following flop counts are wrong! Consider only the time to solution!
#RG_mixed CG: flopcount (for e/o tmWilson only): t/s: 3.9596e+00 mflops_local: 1966.2 mflops: 31459.1
# Constructing LEMON writer for file source.0600.00.00.inverted for append = 0
# Time spent writing 12.6 Mb was 84.0 ms.
# Writing speed: 150 Mb/s (9.36 Mb/s per MPI process).
# Inversion done in 287 iterations, squared residue = 2.537561e-15!

Identical inversion on my laptop

# Using even/odd preconditioning!
# Using Mixed Precision CG!
#RG_Mixed CG: Using relative precision! eps_sq: 1e-14 target_eps_sq: 3.88378e-15 
#RG_Mixed CG: N_outer: 10 
# RG_mixed CG: iter_out: 4 iter_in_sp: 278 iter_in_dp: 0
#RG_mixed CG: iter: 282 eps_sq: 1.0000e-14 t/s: 1.3018e+02
# FIXME: note the following flop counts are wrong! Consider only the time to solution!
#RG_mixed CG: flopcount (for e/o tmWilson only): t/s: 1.3018e+02 mflops_local: 940.2 mflops: 940.2
# Constructing LIME writer for file source.0600.00.00.inverted for append = 0
# Time spent writing 12.6 Mb was 82.2 ms.
# Writing speed: 153 Mb/s (153 Mb/s per MPI process).
# Inversion done in 282 iterations, squared residue = 2.924034e-15!

Conclusion

The fact that I get the same result (up to rounding differences) on my laptop and the AVX2 KNL indicates that the AVX512 KNL inversion is not correct...

Either I manage to remove the padding that is added due to alignment, or we have to nuke everything to hell and back...

kostrzewa commented 6 years ago

I also checked the propagators and the one from my laptop and AVX2 on KNL agree...

kostrzewa commented 6 years ago

It should be noted that both are AVX2, so it would probably be prudent to check without any vectorisation....

kostrzewa commented 6 years ago

Checking without any optimization, the AVX2 propagators agree down to 1.4e-26 while the AVX512 propagator disagrees.

kostrzewa commented 6 years ago

I honestly can't see an easy way of solving this...

kostrzewa commented 6 years ago

Just to make things clear: this is without QPhiX, just tmLQCD operators...

kostrzewa commented 6 years ago

Okay, so I've narrowed it down to halfspinor, compililng without it makes things work, including the QPhiX Dslash test for all our single-flavour operators (I don't have test for the two-flavour operators, so we'll have to live with this).

The negative effect on performance in the HMC should be minimal, so I think this is an okay workaround with the caveat in mind that this will SURELY break again in the future...

kostrzewa commented 6 years ago

I'm still checking if the IO works correctly, but I guess if the fullspinor operator works, then so will IO.

kostrzewa commented 6 years ago

Okay, so IO seems to work fine too.

sunpho84 commented 6 years ago

Which structs are you talking about? Do you mean "su3", "su32" etc? Can we check about the possible padding of the internal data? For example printing explicitly

su3 u;
printf("%d\n",(int)(&u.c01-&u.c00)/(1*sizeof(u.c00)));
printf("%d\n",(int)(&u.c02-&u.c00)/(2*sizeof(u.c00)));

for the structure of which you are suspicious? Then if padding is actually present, one could change it with some appropriate local directive to the compiler... http://www.catb.org/esr/structure-packing/

kostrzewa commented 6 years ago

I tried the directives of course, as well as disabling padding globally. Neither helped, probably a consequence of MIC-AVX512.

The problem is likely the halfspinor data-type, which contains only two su3 vectors, each of which is only six doubles. I haven't gotten around to checking the addresses yet.

sunpho84 commented 6 years ago

Can you give me the configure string? I will make some test myself

sunpho84 commented 6 years ago

I've tried compiling with the following configure string:

../configure \
    '--disable-omp' \
    '--enable-mpi' \
    '--with-mpidimension=4' \
    '--enable-alignment=32' \
    '--with-lapack=-L/cineca/prod/opt/compilers/intel/pe-xe-2016/binary/mkl/lib/intel64 -lmkl_blas95_lp64 -lmkl_avx2 -lmkl_core -lmkl_sequential -lmkl_intel_lp64' \
    '--with-limedir=/marconi/home/userexternal/fsanfili' \
    '--with-lemondir=/marconi/home/userexternal/fsanfili' \
    '--disable-sse2' \
    '--disable-sse3' \
    '--enable-gaugecopy' \
    '--enable-halfspinor' \
    '--host=x86_64-pc-linux-gnu' \
    'CC=mpiicc' \
    'CFLAGS=-xMIC-AVX512 -O3 -std=c99' \
    'F77=ifort' \
    'CXX=icpc'

and I added to the very beginning of hmc_tm.c this piece of code:

halfspinor32 test[2];
  if(g_proc_id==0)
    {
      printf("%p expected %p\n",&test[0].s0.c0,&test[0].s0.c0+0);
      printf("%p expected %p\n",&test[0].s0.c1,&test[0].s0.c0+1);
      printf("%p expected %p\n",&test[0].s0.c2,&test[0].s0.c0+2);
      printf("%p expected %p\n",&test[0].s1.c0,&test[0].s0.c0+3);
      printf("%p expected %p\n",&test[0].s1.c1,&test[0].s0.c0+4);
      printf("%p expected %p\n",&test[0].s1.c2,&test[0].s0.c0+5);
      printf("%p expected %p\n",&test[1].s0.c0,&test[0].s0.c0+6);
      printf("%p expected %p\n",&test[1].s0.c1,&test[0].s0.c0+7);
      printf("%p expected %p\n",&test[1].s0.c2,&test[0].s0.c0+8);
      printf("%p expected %p\n",&test[1].s1.c0,&test[0].s0.c0+9);
      printf("%p expected %p\n",&test[1].s1.c1,&test[0].s0.c0+10);
      printf("%p expected %p\n",&test[1].s1.c2,&test[0].s0.c0+11);
    }

I get the following output:

0x7ffe3cdfd544 expected 0x7ffe3cdfd544
0x7ffe3cdfd54c expected 0x7ffe3cdfd54c
0x7ffe3cdfd554 expected 0x7ffe3cdfd554
0x7ffe3cdfd55c expected 0x7ffe3cdfd55c
0x7ffe3cdfd564 expected 0x7ffe3cdfd564
0x7ffe3cdfd56c expected 0x7ffe3cdfd56c
0x7ffe3cdfd574 expected 0x7ffe3cdfd574
0x7ffe3cdfd57c expected 0x7ffe3cdfd57c
0x7ffe3cdfd584 expected 0x7ffe3cdfd584
0x7ffe3cdfd58c expected 0x7ffe3cdfd58c
0x7ffe3cdfd594 expected 0x7ffe3cdfd594
0x7ffe3cdfd59c expected 0x7ffe3cdfd59c

I am ready to run the same test with your configure string

kostrzewa commented 6 years ago

Thanks for testing this! So I've done the same test and arrive at the same conclusion. Thus, it is probably unlikely that we're dealing with a padding problem as I assumed initially.

I compile with

/marconi/home/userexternal/bkostrze/code/tmLQCD.qphix_devel_hmc_2f_tmclover/configure \
  --host=x86_64-linux-gnu \
  --with-limedir=$HOME/local/libs/lime \
  --with-lemondir=$HOME/local/libs/lemon \
  --with-mpidimension=4 --enable-omp --enable-mpi \
  --disable-sse2 --disable-sse3 \
  --with-lapack="-Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a -Wl,--end-group -lpthread -lm -ldl" \
  --enable-halfspinor --enable-gaugecopy \                                                                                                                  
  --enable-alignment=64 \
  --enable-qphix-soalen=4 \
  --with-DDalphaAMG=${HOME}/code/DDalphaAMG.nd \
  --with-qphixdir=$HOME/local/knl/qphix.jinja.2flavour \
  --with-qmpdir=$HOME/local/libs/qmp \
  CC=mpiicc CXX=mpiicpc F77=ifort \
  CFLAGS="-O3 -std=c99 -qopenmp -xMIC-AVX512 -fma -debug full" \
  CXXFLAGS="-O3 -std=c++11 -qopenmp -xMIC-AVX512 -fma -g -debug full" \
  LDFLAGS="-qopenmp"

and tested like this:

  if(g_proc_id==0){
    ALIGN halfspinor test[3];
    ALIGN halfspinor32 test32[3];

    halfspinor* test_d = (halfspinor*) aligned_malloc(3*sizeof(halfspinor));
    halfspinor32* test32_d = (halfspinor32*) aligned_malloc(3*sizeof(halfspinor32));

    double* testp = (double*)&(test[0].s0.c0);
    float* test32p = (float*)&(test32[0].s0.c0);

    double* testp_d = (double*)&(test_d[0].s0.c0);
    float* test32p_d = (float*)&(test32_d[0].s0.c0);

    printf("struct %p ? %p flat pointer\n", &(test[0].s0.c0), testp +  0);
    printf("struct %p ? %p flat pointer\n", &(test[0].s0.c1), testp +  2);
    printf("struct %p ? %p flat pointer\n", &(test[0].s0.c2), testp +  4);
    printf("struct %p ? %p flat pointer\n", &(test[0].s1.c0), testp +  6);
    printf("struct %p ? %p flat pointer\n", &(test[0].s1.c1), testp +  8);
    printf("struct %p ? %p flat pointer\n", &(test[0].s1.c2), testp + 10);
    printf("struct %p ? %p flat pointer\n", &(test[1].s0.c0), testp + 12);
    printf("struct %p ? %p flat pointer\n", &(test[1].s0.c1), testp + 14);
    printf("struct %p ? %p flat pointer\n", &(test[1].s0.c2), testp + 16);
    printf("struct %p ? %p flat pointer\n", &(test[1].s1.c0), testp + 18);
    printf("struct %p ? %p flat pointer\n", &(test[1].s1.c1), testp + 20);
    printf("struct %p ? %p flat pointer\n", &(test[1].s1.c2), testp + 22);

    printf("struct %p ? %p flat pointer\n", &(test32[0].s0.c0), test32p +  0);
    printf("struct %p ? %p flat pointer\n", &(test32[0].s0.c1), test32p +  2);
    printf("struct %p ? %p flat pointer\n", &(test32[0].s0.c2), test32p +  4);
    printf("struct %p ? %p flat pointer\n", &(test32[0].s1.c0), test32p +  6);
    printf("struct %p ? %p flat pointer\n", &(test32[0].s1.c1), test32p +  8);
    printf("struct %p ? %p flat pointer\n", &(test32[0].s1.c2), test32p + 10);
    printf("struct %p ? %p flat pointer\n", &(test32[1].s0.c0), test32p + 12);
    printf("struct %p ? %p flat pointer\n", &(test32[1].s0.c1), test32p + 14);
    printf("struct %p ? %p flat pointer\n", &(test32[1].s0.c2), test32p + 16);
    printf("struct %p ? %p flat pointer\n", &(test32[1].s1.c0), test32p + 18);
    printf("struct %p ? %p flat pointer\n", &(test32[1].s1.c1), test32p + 20);
    printf("struct %p ? %p flat pointer\n", &(test32[1].s1.c2), test32p + 22);

    printf("struct %p ? %p flat pointer\n", &(test_d[0].s0.c0), testp_d +  0);
    printf("struct %p ? %p flat pointer\n", &(test_d[0].s0.c1), testp_d +  2);
    printf("struct %p ? %p flat pointer\n", &(test_d[0].s0.c2), testp_d +  4);
    printf("struct %p ? %p flat pointer\n", &(test_d[0].s1.c0), testp_d +  6);
    printf("struct %p ? %p flat pointer\n", &(test_d[0].s1.c1), testp_d +  8);
    printf("struct %p ? %p flat pointer\n", &(test_d[0].s1.c2), testp_d + 10);
    printf("struct %p ? %p flat pointer\n", &(test_d[1].s0.c0), testp_d + 12);
    printf("struct %p ? %p flat pointer\n", &(test_d[1].s0.c1), testp_d + 14);
    printf("struct %p ? %p flat pointer\n", &(test_d[1].s0.c2), testp_d + 16);
    printf("struct %p ? %p flat pointer\n", &(test_d[1].s1.c0), testp_d + 18);
    printf("struct %p ? %p flat pointer\n", &(test_d[1].s1.c1), testp_d + 20);
    printf("struct %p ? %p flat pointer\n", &(test_d[1].s1.c2), testp_d + 22);

    printf("struct %p ? %p flat pointer\n", &(test32_d[0].s0.c0), test32p_d +  0);
    printf("struct %p ? %p flat pointer\n", &(test32_d[0].s0.c1), test32p_d +  2);
    printf("struct %p ? %p flat pointer\n", &(test32_d[0].s0.c2), test32p_d +  4);
    printf("struct %p ? %p flat pointer\n", &(test32_d[0].s1.c0), test32p_d +  6);
    printf("struct %p ? %p flat pointer\n", &(test32_d[0].s1.c1), test32p_d +  8);
    printf("struct %p ? %p flat pointer\n", &(test32_d[0].s1.c2), test32p_d + 10);
    printf("struct %p ? %p flat pointer\n", &(test32_d[1].s0.c0), test32p_d + 12);
    printf("struct %p ? %p flat pointer\n", &(test32_d[1].s0.c1), test32p_d + 14);
    printf("struct %p ? %p flat pointer\n", &(test32_d[1].s0.c2), test32p_d + 16);
    printf("struct %p ? %p flat pointer\n", &(test32_d[1].s1.c0), test32p_d + 18);
    printf("struct %p ? %p flat pointer\n", &(test32_d[1].s1.c1), test32p_d + 20);
    printf("struct %p ? %p flat pointer\n", &(test32_d[1].s1.c2), test32p_d + 22);
    aligned_free(test_d);
    aligned_free(test32_d);

  }

with the same results as you got.

The next candidate for the origin of this problem is alignment. The code is unfortunately very patchy when it comes to properly doing alignment everywhere, so maybe the next thing that should be tried is to use aligned_malloc consistently everywhere (or, at least everywhere relevant to this issue for now).

sunpho84 commented 6 years ago

The simpler that I can imagine concerning alignement is that the compiler is twisted to use the aligned read when it should not, but that should issue an exception. No part of the code is using explicitly AVX-512 intrinsics? If not, the compiler should be taking care of making all needed tests before using aligned read, so even using malloc in place of algined_malloc should have no impact (beside performances)

kostrzewa commented 6 years ago

No, we're not using AVX512 intrinsics anywhere. From what I can see, the problem occurs in the halfspinor communication because results vary with the number of processes. On the QPhiX side I also found a crasher which seems to originate in the MPI library, but I need to investigate further. The two might be related, but I can't tell right now. The crasher occurs ONLY in the HMC and only when there's a clover term...

kostrzewa commented 6 years ago

So this crasher is a weird one indeed. When I compile with halfspinor enabled, there is NO crash in the QPhiX interface but the tmLQCD operator does the wrong thing. When I compile with it disabled, the tmLQCD operators work but the QPhiX interface crashes in the HMC.

Do you happen to have an AVX512-capable box that I could run some tests on? This is truly infuriating...

kostrzewa commented 6 years ago

Okay, I think I've resolved the crasher at least. QMP needs to be compiled with -xMIC-AVX512, otherwise it seems that somewhere wrong assumptions about alignment are made and one gets negative addresses for next pointers in MPI cleanup (deep inside the Intel MPI library...). Luckily I got TotalView running late this afternoon and recompiled QMP on a hunch.

Another issue that I had, was that I did not allocate halos for fields which I used for checking the residual, something which didn't cause problems with AVX2 either on Marconi or on my laptop, likely because stuff probably still ended up working correctly by accident. When we redesign tmLQCD, it would be great if field halos were separate memory regions within some kind of field object, such that an operation on a field can simply allocate a halo if it requires one, keeping in mind all the subtleties involving overlapping computation and comms.

Unfortunately, the halfspinor problem remains and could be caused by any number of things, but at least I have a working version of tmLQCD which seems to be able to take advantage of AVX512 where it matters. Unfortunately we can't combine the tmLQCD mixed solver and the QPhiX solver now (there are no 32bit fullspinor operators and these are too slow to be practical anyway), which might have been beneficial for very heavy monomials, where the packing / unpacking overheads can be significant and cancel out the advantage of better vectorisation. However, it seems like we don't lose anything really by not being able to do this. Also, the clover and gauge packing routines have substantial potential for improvement I believe.

sunpho84 commented 6 years ago

Another issue that I had, was that I did not allocate halos for fields which I used for checking the residual, something which didn't cause problems with AVX2 either on Marconi or on my laptop, likely because stuff probably still ended up working correctly by accident. When we redesign tmLQCD, it would be great if field halos were separate memory regions within some kind of field object, such that an operation on a field can simply allocate a halo if it requires one, keeping in mind all the subtleties involving overlapping computation and comms.

Allocation (and consistency) of halo can be traced in the vector itself, or could be stored in a lookup table to be consulted by the halo updating routine.

Which redesign do you have in mind?

Unfortunately, the halfspinor problem remains and could be caused by any number of things, but at least I have a working version of tmLQCD which seems to be able to take advantage of AVX512 where it matters.

If I understand, the crash you solved was unrelated, right?

kostrzewa commented 6 years ago

Which redesign do you have in mind?

https://github.com/etmc/tmLQCD/issues/365

If I understand, the crash you solved was unrelated, right?

Well, it is related insofar as having to do with apparent AVX512 consistency issues that we seem to be facing... Unfortunately, however, the fix for the crasher has not fixed the halfspinor issue, which also originates in the communication as far as I can tell on account of it being dependent on the parallelisation.

kostrzewa commented 5 years ago

@sbacchio @Finkenrath Okay, it's time to get back to this issue because it is of relevance to upcoming simulations on SuperMUC-NG.

It is essential that somebody tests the halfspinor Dirac operator with AVX512 to see if the issue persists. If it does, we probably need to resolve it in order to have good performance for the physical point runs since using the halfspinor operator via CGMMSND is currently the most efficient way to produce initial guesses for the heavy doublet rational approximation when run with DDalphaAMG as a solver for the smallest shifts.

Note that if this issue is still a problem, it will not be possible to use the halfspinor operator on SuperMUC-NG when the code is compiled for AVX512, which might be a substantial performance hit. Just keep this on your radar please.

etmc / tmLQCD