JeffersonLab / qphix-codegen

Code Generator for the QPhiX library, Wilson Fermions
http://jeffersonlab.github.io/qphix-codegen/
1 stars 4 forks source link

Replace KNC instructions with proper KNL instructions #5

Closed martin-ueding closed 7 years ago

martin-ueding commented 7 years ago

The Travis CI testing infrastructure provides Ubuntu with GCC only, there is on Intel Compiler. Also it only has AVX, so I use -march=haswell and -march=knl in order to at least compile it for the other ISA that we target. However, the AVX512 does not compile because _mm512_mask_permute4f128_ps is not available within GCC.

It can be seen easily by compiling the following test program with icpc and g++:

#include <immintrin.h>

int main(int argc, char **argv) {
    __m512 src;
    __mmask16 k;
    __m512 a;
    _MM_PERM_ENUM imm8;
    _mm512_mask_permute4f128_ps(src, k, a, imm8);
}

If some alternative to this instruction could be found such that GCC can understand, the AVX512 code could be automatically compiled on Travis CI.

It might turn out that there are more instructions missing, perhaps the AVX512 support is not complete in GCC 6.x? In that case, it does not make much sense to change anything. Instead waiting for GCC to catch up would be the better plan.

ddkalamk commented 7 years ago

The _mm512_mask_permute4f128_ps should not be used in AVX512 code. It is a KNC intrinsic. On KNL we use _mm512_mask_shuffle_f32x4 instead and that should be supported by GCC.

Could you provide more information where the compilation fails?

ddkalamk commented 7 years ago

Sorry, I found that I have made changes in my local copy and changes were never pushed to github. I will create a patch to fix this.

ddkalamk commented 7 years ago

Can you please take a look at changes made to inst_sp_vec16.cc in this commit 865b7be? https://github.com/JeffersonLab/qphix-codegen/commit/865b7be34a0473b1de766be856eb631f6e1ea36f

Dhiraj

martin-ueding commented 7 years ago

It looks like new shuffle/permute instruction wrappers were added, but it does not look like anything was deleted there?

The commits are quite old, but they seem to merge with devel cleanly, at least. Shall I merge them into devel? Or is that just some untested experiment that should better not be merged?

ddkalamk commented 7 years ago

Please do not merge it with devel. Most of the commits are to add new experimental layout (I called it xyzt layout as opposed to current SOA layout). Though the code is tested for correctness using microbenchmark it requires a lot of rewriting in QPhiX library to use this new layout. So better we keep it in a separate branch until we add support for new layout in QPhiX.

kostrzewa commented 7 years ago

@ddkalamk thanks for the commit. We are following the strategy of extracting the modifications to transpose and the new shuffle instruction for the AVX512 target architecture. These will be included with the modifications that we have been making to get qphix working for twisted mass and twisted clover quarks.

Since the devel branch as it is right now was still using _mm512_mask_permute4f128_ps, does this mean that it doesn't work correctly on KNL for certain edge cases?

martin-ueding commented 7 years ago

@kostrzewa and I have theorized that perhaps the Intel compiler knows how to translate that particular KNC instruction to a KNL instruction and does so without giving an error. If there was a warning, it was drowned in all the #warning directives that are used to show the #ifdef branches taken.

I have re-generated the kernels, it seems that the KNC instruction has been properly replaced by the KNL one. It's still building, let's see how it goes.

martin-ueding commented 7 years ago

The changes so far seem to have reduced the occurences, but its still there. So float does not have them any more (compared to MIC), but double still has it:

$ grep -rl _mm512_mask_permute4f128_ps generated | sort
generated/avx512/generated/clov_double_dslash_achimbdpsi_minus_body_double_double_v8_s4_12
generated/avx512/generated/clov_double_dslash_achimbdpsi_minus_body_double_double_v8_s4_18
generated/avx512/generated/clov_double_dslash_achimbdpsi_plus_body_double_double_v8_s4_12
generated/avx512/generated/clov_double_dslash_achimbdpsi_plus_body_double_double_v8_s4_18
generated/avx512/generated/clov_double_dslash_minus_body_double_double_v8_s4_12
generated/avx512/generated/clov_double_dslash_minus_body_double_double_v8_s4_18
generated/avx512/generated/clov_double_dslash_plus_body_double_double_v8_s4_12
generated/avx512/generated/clov_double_dslash_plus_body_double_double_v8_s4_18
generated/avx512/generated/dslash_achimbdpsi_minus_body_double_double_v8_s4_12
generated/avx512/generated/dslash_achimbdpsi_minus_body_double_double_v8_s4_18
generated/avx512/generated/dslash_achimbdpsi_plus_body_double_double_v8_s4_12
generated/avx512/generated/dslash_achimbdpsi_plus_body_double_double_v8_s4_18
generated/avx512/generated/dslash_minus_body_double_double_v8_s4_12
generated/avx512/generated/dslash_minus_body_double_double_v8_s4_18
generated/avx512/generated/dslash_plus_body_double_double_v8_s4_12
generated/avx512/generated/dslash_plus_body_double_double_v8_s4_18
generated/avx512/generated/tm_clov_double_dslash_achimbdpsi_minus_body_double_double_v8_s4_12
generated/avx512/generated/tm_clov_double_dslash_achimbdpsi_minus_body_double_double_v8_s4_18
generated/avx512/generated/tm_clov_double_dslash_achimbdpsi_plus_body_double_double_v8_s4_12
generated/avx512/generated/tm_clov_double_dslash_achimbdpsi_plus_body_double_double_v8_s4_18
generated/avx512/generated/tm_clov_double_dslash_minus_body_double_double_v8_s4_12
generated/avx512/generated/tm_clov_double_dslash_minus_body_double_double_v8_s4_18
generated/avx512/generated/tm_clov_double_dslash_plus_body_double_double_v8_s4_12
generated/avx512/generated/tm_clov_double_dslash_plus_body_double_double_v8_s4_18
generated/avx512/generated/tm_dslash_achimbdpsi_minus_body_double_double_v8_s4_12
generated/avx512/generated/tm_dslash_achimbdpsi_minus_body_double_double_v8_s4_18
generated/avx512/generated/tm_dslash_achimbdpsi_plus_body_double_double_v8_s4_12
generated/avx512/generated/tm_dslash_achimbdpsi_plus_body_double_double_v8_s4_18
generated/avx512/generated/tm_dslash_minus_body_double_double_v8_s4_12
generated/avx512/generated/tm_dslash_minus_body_double_double_v8_s4_18
generated/avx512/generated/tm_dslash_plus_body_double_double_v8_s4_12
generated/avx512/generated/tm_dslash_plus_body_double_double_v8_s4_18
generated/mic/generated/clov_double_dslash_achimbdpsi_minus_body_double_double_v8_s4_12
generated/mic/generated/clov_double_dslash_achimbdpsi_minus_body_double_double_v8_s4_18
generated/mic/generated/clov_double_dslash_achimbdpsi_plus_body_double_double_v8_s4_12
generated/mic/generated/clov_double_dslash_achimbdpsi_plus_body_double_double_v8_s4_18
generated/mic/generated/clov_double_dslash_minus_body_double_double_v8_s4_12
generated/mic/generated/clov_double_dslash_minus_body_double_double_v8_s4_18
generated/mic/generated/clov_double_dslash_plus_body_double_double_v8_s4_12
generated/mic/generated/clov_double_dslash_plus_body_double_double_v8_s4_18
generated/mic/generated/clov_float_dslash_achimbdpsi_minus_body_float_float_v16_s4_12
generated/mic/generated/clov_float_dslash_achimbdpsi_minus_body_float_float_v16_s4_18
generated/mic/generated/clov_float_dslash_achimbdpsi_minus_body_float_float_v16_s8_12
generated/mic/generated/clov_float_dslash_achimbdpsi_minus_body_float_float_v16_s8_18
generated/mic/generated/clov_float_dslash_achimbdpsi_plus_body_float_float_v16_s4_12
generated/mic/generated/clov_float_dslash_achimbdpsi_plus_body_float_float_v16_s4_18
generated/mic/generated/clov_float_dslash_achimbdpsi_plus_body_float_float_v16_s8_12
generated/mic/generated/clov_float_dslash_achimbdpsi_plus_body_float_float_v16_s8_18
generated/mic/generated/clov_float_dslash_minus_body_float_float_v16_s4_12
generated/mic/generated/clov_float_dslash_minus_body_float_float_v16_s4_18
generated/mic/generated/clov_float_dslash_minus_body_float_float_v16_s8_12
generated/mic/generated/clov_float_dslash_minus_body_float_float_v16_s8_18
generated/mic/generated/clov_float_dslash_plus_body_float_float_v16_s4_12
generated/mic/generated/clov_float_dslash_plus_body_float_float_v16_s4_18
generated/mic/generated/clov_float_dslash_plus_body_float_float_v16_s8_12
generated/mic/generated/clov_float_dslash_plus_body_float_float_v16_s8_18
generated/mic/generated/clov_half_dslash_achimbdpsi_minus_body_half_half_v16_s4_12
generated/mic/generated/clov_half_dslash_achimbdpsi_minus_body_half_half_v16_s4_18
generated/mic/generated/clov_half_dslash_achimbdpsi_minus_body_half_half_v16_s8_12
generated/mic/generated/clov_half_dslash_achimbdpsi_minus_body_half_half_v16_s8_18
generated/mic/generated/clov_half_dslash_achimbdpsi_plus_body_half_half_v16_s4_12
generated/mic/generated/clov_half_dslash_achimbdpsi_plus_body_half_half_v16_s4_18
generated/mic/generated/clov_half_dslash_achimbdpsi_plus_body_half_half_v16_s8_12
generated/mic/generated/clov_half_dslash_achimbdpsi_plus_body_half_half_v16_s8_18
generated/mic/generated/clov_half_dslash_minus_body_half_half_v16_s4_12
generated/mic/generated/clov_half_dslash_minus_body_half_half_v16_s4_18
generated/mic/generated/clov_half_dslash_minus_body_half_half_v16_s8_12
generated/mic/generated/clov_half_dslash_minus_body_half_half_v16_s8_18
generated/mic/generated/clov_half_dslash_plus_body_half_half_v16_s4_12
generated/mic/generated/clov_half_dslash_plus_body_half_half_v16_s4_18
generated/mic/generated/clov_half_dslash_plus_body_half_half_v16_s8_12
generated/mic/generated/clov_half_dslash_plus_body_half_half_v16_s8_18
generated/mic/generated/dslash_achimbdpsi_minus_body_double_double_v8_s4_12
generated/mic/generated/dslash_achimbdpsi_minus_body_double_double_v8_s4_18
generated/mic/generated/dslash_achimbdpsi_minus_body_float_float_v16_s4_12
generated/mic/generated/dslash_achimbdpsi_minus_body_float_float_v16_s4_18
generated/mic/generated/dslash_achimbdpsi_minus_body_float_float_v16_s8_12
generated/mic/generated/dslash_achimbdpsi_minus_body_float_float_v16_s8_18
generated/mic/generated/dslash_achimbdpsi_minus_body_half_half_v16_s4_12
generated/mic/generated/dslash_achimbdpsi_minus_body_half_half_v16_s4_18
generated/mic/generated/dslash_achimbdpsi_minus_body_half_half_v16_s8_12
generated/mic/generated/dslash_achimbdpsi_minus_body_half_half_v16_s8_18
generated/mic/generated/dslash_achimbdpsi_plus_body_double_double_v8_s4_12
generated/mic/generated/dslash_achimbdpsi_plus_body_double_double_v8_s4_18
generated/mic/generated/dslash_achimbdpsi_plus_body_float_float_v16_s4_12
generated/mic/generated/dslash_achimbdpsi_plus_body_float_float_v16_s4_18
generated/mic/generated/dslash_achimbdpsi_plus_body_float_float_v16_s8_12
generated/mic/generated/dslash_achimbdpsi_plus_body_float_float_v16_s8_18
generated/mic/generated/dslash_achimbdpsi_plus_body_half_half_v16_s4_12
generated/mic/generated/dslash_achimbdpsi_plus_body_half_half_v16_s4_18
generated/mic/generated/dslash_achimbdpsi_plus_body_half_half_v16_s8_12
generated/mic/generated/dslash_achimbdpsi_plus_body_half_half_v16_s8_18
generated/mic/generated/dslash_minus_body_double_double_v8_s4_12
generated/mic/generated/dslash_minus_body_double_double_v8_s4_18
generated/mic/generated/dslash_minus_body_float_float_v16_s4_12
generated/mic/generated/dslash_minus_body_float_float_v16_s4_18
generated/mic/generated/dslash_minus_body_float_float_v16_s8_12
generated/mic/generated/dslash_minus_body_float_float_v16_s8_18
generated/mic/generated/dslash_minus_body_half_half_v16_s4_12
generated/mic/generated/dslash_minus_body_half_half_v16_s4_18
generated/mic/generated/dslash_minus_body_half_half_v16_s8_12
generated/mic/generated/dslash_minus_body_half_half_v16_s8_18
generated/mic/generated/dslash_plus_body_double_double_v8_s4_12
generated/mic/generated/dslash_plus_body_double_double_v8_s4_18
generated/mic/generated/dslash_plus_body_float_float_v16_s4_12
generated/mic/generated/dslash_plus_body_float_float_v16_s4_18
generated/mic/generated/dslash_plus_body_float_float_v16_s8_12
generated/mic/generated/dslash_plus_body_float_float_v16_s8_18
generated/mic/generated/dslash_plus_body_half_half_v16_s4_12
generated/mic/generated/dslash_plus_body_half_half_v16_s4_18
generated/mic/generated/dslash_plus_body_half_half_v16_s8_12
generated/mic/generated/dslash_plus_body_half_half_v16_s8_18
generated/mic/generated/tm_clov_double_dslash_achimbdpsi_minus_body_double_double_v8_s4_12
generated/mic/generated/tm_clov_double_dslash_achimbdpsi_minus_body_double_double_v8_s4_18
generated/mic/generated/tm_clov_double_dslash_achimbdpsi_plus_body_double_double_v8_s4_12
generated/mic/generated/tm_clov_double_dslash_achimbdpsi_plus_body_double_double_v8_s4_18
generated/mic/generated/tm_clov_double_dslash_minus_body_double_double_v8_s4_12
generated/mic/generated/tm_clov_double_dslash_minus_body_double_double_v8_s4_18
generated/mic/generated/tm_clov_double_dslash_plus_body_double_double_v8_s4_12
generated/mic/generated/tm_clov_double_dslash_plus_body_double_double_v8_s4_18
generated/mic/generated/tm_clov_float_dslash_achimbdpsi_minus_body_float_float_v16_s4_12
generated/mic/generated/tm_clov_float_dslash_achimbdpsi_minus_body_float_float_v16_s4_18
generated/mic/generated/tm_clov_float_dslash_achimbdpsi_minus_body_float_float_v16_s8_12
generated/mic/generated/tm_clov_float_dslash_achimbdpsi_minus_body_float_float_v16_s8_18
generated/mic/generated/tm_clov_float_dslash_achimbdpsi_plus_body_float_float_v16_s4_12
generated/mic/generated/tm_clov_float_dslash_achimbdpsi_plus_body_float_float_v16_s4_18
generated/mic/generated/tm_clov_float_dslash_achimbdpsi_plus_body_float_float_v16_s8_12
generated/mic/generated/tm_clov_float_dslash_achimbdpsi_plus_body_float_float_v16_s8_18
generated/mic/generated/tm_clov_float_dslash_minus_body_float_float_v16_s4_12
generated/mic/generated/tm_clov_float_dslash_minus_body_float_float_v16_s4_18
generated/mic/generated/tm_clov_float_dslash_minus_body_float_float_v16_s8_12
generated/mic/generated/tm_clov_float_dslash_minus_body_float_float_v16_s8_18
generated/mic/generated/tm_clov_float_dslash_plus_body_float_float_v16_s4_12
generated/mic/generated/tm_clov_float_dslash_plus_body_float_float_v16_s4_18
generated/mic/generated/tm_clov_float_dslash_plus_body_float_float_v16_s8_12
generated/mic/generated/tm_clov_float_dslash_plus_body_float_float_v16_s8_18
generated/mic/generated/tm_clov_half_dslash_achimbdpsi_minus_body_half_half_v16_s4_12
generated/mic/generated/tm_clov_half_dslash_achimbdpsi_minus_body_half_half_v16_s4_18
generated/mic/generated/tm_clov_half_dslash_achimbdpsi_minus_body_half_half_v16_s8_12
generated/mic/generated/tm_clov_half_dslash_achimbdpsi_minus_body_half_half_v16_s8_18
generated/mic/generated/tm_clov_half_dslash_achimbdpsi_plus_body_half_half_v16_s4_12
generated/mic/generated/tm_clov_half_dslash_achimbdpsi_plus_body_half_half_v16_s4_18
generated/mic/generated/tm_clov_half_dslash_achimbdpsi_plus_body_half_half_v16_s8_12
generated/mic/generated/tm_clov_half_dslash_achimbdpsi_plus_body_half_half_v16_s8_18
generated/mic/generated/tm_clov_half_dslash_minus_body_half_half_v16_s4_12
generated/mic/generated/tm_clov_half_dslash_minus_body_half_half_v16_s4_18
generated/mic/generated/tm_clov_half_dslash_minus_body_half_half_v16_s8_12
generated/mic/generated/tm_clov_half_dslash_minus_body_half_half_v16_s8_18
generated/mic/generated/tm_clov_half_dslash_plus_body_half_half_v16_s4_12
generated/mic/generated/tm_clov_half_dslash_plus_body_half_half_v16_s4_18
generated/mic/generated/tm_clov_half_dslash_plus_body_half_half_v16_s8_12
generated/mic/generated/tm_clov_half_dslash_plus_body_half_half_v16_s8_18
generated/mic/generated/tm_dslash_achimbdpsi_minus_body_double_double_v8_s4_12
generated/mic/generated/tm_dslash_achimbdpsi_minus_body_double_double_v8_s4_18
generated/mic/generated/tm_dslash_achimbdpsi_minus_body_float_float_v16_s4_12
generated/mic/generated/tm_dslash_achimbdpsi_minus_body_float_float_v16_s4_18
generated/mic/generated/tm_dslash_achimbdpsi_minus_body_float_float_v16_s8_12
generated/mic/generated/tm_dslash_achimbdpsi_minus_body_float_float_v16_s8_18
generated/mic/generated/tm_dslash_achimbdpsi_minus_body_half_half_v16_s4_12
generated/mic/generated/tm_dslash_achimbdpsi_minus_body_half_half_v16_s4_18
generated/mic/generated/tm_dslash_achimbdpsi_minus_body_half_half_v16_s8_12
generated/mic/generated/tm_dslash_achimbdpsi_minus_body_half_half_v16_s8_18
generated/mic/generated/tm_dslash_achimbdpsi_plus_body_double_double_v8_s4_12
generated/mic/generated/tm_dslash_achimbdpsi_plus_body_double_double_v8_s4_18
generated/mic/generated/tm_dslash_achimbdpsi_plus_body_float_float_v16_s4_12
generated/mic/generated/tm_dslash_achimbdpsi_plus_body_float_float_v16_s4_18
generated/mic/generated/tm_dslash_achimbdpsi_plus_body_float_float_v16_s8_12
generated/mic/generated/tm_dslash_achimbdpsi_plus_body_float_float_v16_s8_18
generated/mic/generated/tm_dslash_achimbdpsi_plus_body_half_half_v16_s4_12
generated/mic/generated/tm_dslash_achimbdpsi_plus_body_half_half_v16_s4_18
generated/mic/generated/tm_dslash_achimbdpsi_plus_body_half_half_v16_s8_12
generated/mic/generated/tm_dslash_achimbdpsi_plus_body_half_half_v16_s8_18
generated/mic/generated/tm_dslash_minus_body_double_double_v8_s4_12
generated/mic/generated/tm_dslash_minus_body_double_double_v8_s4_18
generated/mic/generated/tm_dslash_minus_body_float_float_v16_s4_12
generated/mic/generated/tm_dslash_minus_body_float_float_v16_s4_18
generated/mic/generated/tm_dslash_minus_body_float_float_v16_s8_12
generated/mic/generated/tm_dslash_minus_body_float_float_v16_s8_18
generated/mic/generated/tm_dslash_minus_body_half_half_v16_s4_12
generated/mic/generated/tm_dslash_minus_body_half_half_v16_s4_18
generated/mic/generated/tm_dslash_minus_body_half_half_v16_s8_12
generated/mic/generated/tm_dslash_minus_body_half_half_v16_s8_18
generated/mic/generated/tm_dslash_plus_body_double_double_v8_s4_12
generated/mic/generated/tm_dslash_plus_body_double_double_v8_s4_18
generated/mic/generated/tm_dslash_plus_body_float_float_v16_s4_12
generated/mic/generated/tm_dslash_plus_body_float_float_v16_s4_18
generated/mic/generated/tm_dslash_plus_body_float_float_v16_s8_12
generated/mic/generated/tm_dslash_plus_body_float_float_v16_s8_18
generated/mic/generated/tm_dslash_plus_body_half_half_v16_s4_12
generated/mic/generated/tm_dslash_plus_body_half_half_v16_s4_18
generated/mic/generated/tm_dslash_plus_body_half_half_v16_s8_12
generated/mic/generated/tm_dslash_plus_body_half_half_v16_s8_18

I need to see whether there are more transposition functions that need to be changed.

kostrzewa commented 7 years ago

@ddkalamk @bjoo Since we have found so many more KNC intrinsics being used for AVX512 (rather than MIC), it is rather urgent for us to understand where to go from here. Is qphix/devel with kernels generated using qphix-codegen/devel currently broken on KNL as a result of these?

bjoo commented 7 years ago

Hi Bartosz, Things seems to work fine on KNL with the Intel toolchain, tho I agree we should work through and fix these.

Best, B

ddkalamk commented 7 years ago

This is result of porting KNC code to KNL and fixing compiler issues as we see those. We never tried using GCC so didn’t even realized there is any issue. If you get a list of intrinsic being used but not supported by GCC, I can try to find replacement for that. Or we can have a wrapper that translates KNC intrinsic to KNL within QPhiX library.

azrael417 commented 7 years ago

In my opinion, Balint and Martin Might have a different opinion, we should just rip that knc stuff out completely and discontinue support.

martin-ueding commented 7 years ago

The systems that my group has quota on are either Haswell, Broadwell, KNL, or Blue Gene/Q. Also from my colleages I heard that KNC is essentially dead. So from my limited view, KNC support could be dropped.

I have implemented @ddkalamk's changes for float and the current issues are with _mm512_mask_permute4f128_ps. I have tried to just use _mm512_mask_shuffle_f64x2 analogously to float but the mask is different. So it would be great if you, @ddkalamk, could send me a pull request for that particular code to my fork of qphix-codegen.

ddkalamk commented 7 years ago

Can you quickly try this attached patch and see if it works? I didn’t get a chance to try it out.

kostrzewa commented 7 years ago

@ddkalamk your attachment was not added to the discussion via e-mail. Would you perhaps be able to just prepare a branch here on github, based on the devel branch, with the necessary changes?

martin-ueding commented 7 years ago

There is a pull request on my branch, I'm in the process of testing it.

https://github.com/martin-ueding/qphix-codegen/pull/1

kostrzewa commented 7 years ago

I see, cool!

martin-ueding commented 7 years ago

Yesterday I have just copied the float version to double, so the 2×2 transpose was already fixed. Interestingly, the 4×4 transpose is only used with veclen = 8 and soalen = 2, which is not generated. There are no kernels which actually use that instruction.

The problem that I face now is that the types of that functions have changed. The Intel Intrinsics Guide gives the following with void *:

void _mm512_stream_pd (void* mem_addr, __m512d a)

On my Fedora 25 system, it is defined with a double * by GCC:

extern __inline void
__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
_mm512_stream_pd (double *__P, __m512d __A)
{
  __builtin_ia32_movntpd512 (__P, (__v8df) __A);
}

Clang has the same signature:

static __inline__ void __DEFAULT_FN_ATTRS
_mm512_stream_pd (double *__P, __m512d __A)
{
  __builtin_nontemporal_store((__v8df)__A, (__v8df*)__P);
}

I compile with the following, which includes -march=knl:

g++ -DHAVE_CONFIG_H -I. -I/home/mu/Projekte/qphix/tests -I../include/qphix    -I/home/mu/Projekte/qphix/include -I../include -O2 --std=c++11 -fopenmp -fPIC -I/usr/include/libxml2 -O2 --std=c++11 -fopenmp -g -fPIC -march=knl -MT timeDslashNoQDP.o -MD -MP -MF .deps/timeDslashNoQDP.Tpo -c -o timeDslashNoQDP.o /home/mu/Projekte/qphix/tests/timeDslashNoQDP.cc

And then it will fail because the conversion is not allowed without -fpermissive:

In file included from /home/mu/Projekte/qphix/include/qphix/avx512/dslash_avx512_complete_specialization.h:334:0,
                 from /home/mu/Projekte/qphix/include/qphix/dslash_generated.h:21,
                 from /home/mu/Projekte/qphix/include/qphix/dslash_body.h:13,
                 from /home/mu/Projekte/qphix/include/qphix/dslash_def.h:175,
                 from /home/mu/Projekte/qphix/include/qphix/wilson.h:5,
                 from /home/mu/Projekte/qphix/tests/timeDslashNoQDP.cc:3:
/home/mu/Projekte/qphix/include/qphix/avx512/generated/dslash_plus_body_double_double_v8_s4_12: In function 'void QPhiX::dslash_plus_vec(const typename QPhiX::Geometry<FT, veclen, soalen, compress>::FourSpinorBlock*, const typename QPhiX::Geometry<FT, veclen, soalen, compress>::FourSpinorBlock*, const typename QPhiX::Geometry<FT, veclen, soalen, compress>::FourSpinorBlock*, const typename QPhiX::Geometry<FT, veclen, soalen, compress>::FourSpinorBlock*, const typename QPhiX::Geometry<FT, veclen, soalen, compress>::FourSpinorBlock*, typename QPhiX::Geometry<FT, veclen, soalen, compress>::FourSpinorBlock*, const typename QPhiX::Geometry<FT, veclen, soalen, compress>::SU3MatrixBlock*, const int*, const int*, const int*, const int*, const int*, const int*, int, int, int, int, int, const int*, const typename QPhiX::Geometry<FT, veclen, soalen, compress>::FourSpinorBlock*, const typename QPhiX::Geometry<FT, veclen, soalen, compress>::FourSpinorBlock*, const typename QPhiX::Geometry<FT, veclen, soalen, compress>::FourSpinorBlock*, const unsigned int*, FT, FT, FT) [with FT = double; int veclen = 8; int soalen = 4; bool compress12 = true; typename QPhiX::Geometry<FT, veclen, soalen, compress>::FourSpinorBlock = double [3][4][2][4]; typename QPhiX::Geometry<FT, veclen, soalen, compress>::SU3MatrixBlock = double [8][2][3][2][8]]':
/home/mu/Projekte/qphix/include/qphix/avx512/generated/dslash_plus_body_double_double_v8_s4_12:3523:18: error: invalid conversion from 'void*' to 'double*' [-fpermissive]
 _mm512_stream_pd((void*)(((*oBase)[0][0][0] + offs[0])+0),tmp_1_re);
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/6.3.1/include/immintrin.h:45:0,
                 from /home/mu/Projekte/qphix/include/qphix/dslash_utils.h:22,
                 from /home/mu/Projekte/qphix/include/qphix/geometry.h:4,
                 from /home/mu/Projekte/qphix/include/qphix/linearOp.h:5,
                 from /home/mu/Projekte/qphix/include/qphix/wilson.h:4,
                 from /home/mu/Projekte/qphix/tests/timeDslashNoQDP.cc:3:
/usr/lib/gcc/x86_64-redhat-linux/6.3.1/include/avx512fintrin.h:8024:1: note:   initializing argument 1 of 'void _mm512_stream_pd(double*, __m512d)'
 _mm512_stream_pd (double *__P, __m512d __A)
 ^~~~~~~~~~~~~~~~

With a quick test I noticed that the conversion double *void * is allowed, but void *double * is not allowed:

void f_double(double *x) {}
void f_void(void *x) {}

int main() {
    void *void_ptr;
    double *double_ptr;

    f_void(double_ptr); // Works
    f_double(void_ptr); // Fails
}

So I have removed all the (void *) occurrences in the code generator. Then we are off to the next error:

error: '_MM_UPCONV_PS_NONE' was not declared in this scope
beta_vec = _mm512_extload_ps((&coeff_s), _MM_UPCONV_PS_NONE, _MM_BROADCAST_1X16, _MM_HINT_NONE);

Looking at the Intrinsics Guide for _mm512_extload_ps, that is another KNC instruction! Can we please find a replacement for this as well?

martin-ueding commented 7 years ago

Assuming that all intrinsic functions start with _mm512_, I have extracted every used intrinsic from all the currently generated AVX512 kernels. Then I looked through the Intrinsics Guide and found that also the masked version is only supported on KNC. So there are two instructions that need fixing:


And in case you want to try it out yourself, this is the script:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# Copyright © 2017 Martin Ueding <dev@martin-ueding.de>
# Licensed under the MIT license.

import argparse
import re

def main():
    options = _parse_args()

    pattern = re.compile(r'(_mm512_[^(]+)')

    matches = set()

    for filename in options.filename:
        print('Processing', filename)
        with open(filename) as f:
            for line in f:
                m = pattern.findall(line)
                for elem in m:
                    matches.add(elem)

    for elem in sorted(matches):
        print('- `{}`'.format(elem))

def _parse_args():
    '''
    Parses the command line arguments.

    :return: Namespace with arguments.
    :rtype: Namespace
    '''
    parser = argparse.ArgumentParser(description='')
    parser.add_argument('filename', nargs='+')
    options = parser.parse_args()

    return options

if __name__ == '__main__':
    main()
ddkalamk commented 7 years ago

I attempted to fix this. Take a look at https://github.com/ddkalamk/qphix-codegen/commit/269d958b6dd04f7a20cc1c99b3c76683b96f3111

martin-ueding commented 7 years ago

Thanks for the patch, I have cherry-picked that as well. It looks like an odd bug that you have fixed with that, looks like it has always generated two instructions?

Now the first compile error is this:

error: '__mmask' was not declared in this scope
 __mmask accMask;
 ^~~~~~~

The only information that I could find is an article where the example lists __mmask16. What are those mask types?

ddkalamk commented 7 years ago

Apparently mmask is deprecated even in icc, need to use mmask16 instead. Fix this in https://github.com/ddkalamk/qphix-codegen/commit/79016fcc75e46488601002105c13121ca276d2c3

martin-ueding commented 7 years ago

That seems to have fixed that error, we are now at the next one. Take the following program:

#include <immintrin.h>

int main(int argc, char **argv) {
    __m512 a = _mm512_undefined();
}

For some reason, that does not compile:

g++ -Wall -Wpedantic --std=c++11 -march=knl mm512_undefined.cpp 
mm512_undefined.cpp: In function 'int main(int, char**)':
mm512_undefined.cpp:7:33: error: '_mm512_undefined' was not declared in this scope
     __m512 a = _mm512_undefined();
                                 ^

According to the intrinsics guide, there are four variants of this function:

__m512 _mm512_undefined (void)
__m512i _mm512_undefined_epi32 ()
__m512d _mm512_undefined_pd ()
__m512 _mm512_undefined_ps ()

Looking through /usr/lib/gcc/x86_64-redhat-linux/6.3.1/include/, with grep, I could only find the latter three variants and not the first one. In avx512fintrin.h they are defined like this:

extern __inline __m512
__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
_mm512_undefined_ps (void)
{
  __m512 __Y = __Y;
  return __Y;
}

extern __inline __m512d
__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
_mm512_undefined_pd (void)
{
  __m512d __Y = __Y;
  return __Y;
}

extern __inline __m512i
__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
_mm512_undefined_epi32 (void)
{
  __m512i __Y = __Y;
  return __Y;
}

#define _mm512_undefined_si512 _mm512_undefined_epi32

I have replaced all the _mm512_undefined with _mm512_undefined_ps.

Now it builds with GCC on my laptop. Let's see how it goes on Travis CI.

martin-ueding commented 7 years ago

I just wanted to test whether the KNL version still compiles on KNL with the Intel Compiler on the system in Bologna. But their Intel license server is currently down. And today is a national holiday in Italy, so no support on their side.

If any of you would compile the travis-avx512 branch on some other KNL machine with Intel C++ and run all the tests, it would be great.

bjoo commented 7 years ago

Hi Martin, I misread this and thought you meant qphix (rather than qphix-codegen) I did compile qphix and ran the tests. t_twm_dslash throught it needed to have VECLEN=1 for some reason. t_twm_clover ran. Both t_clov_dslash and t_twm_clover ran until they hit SOA=16 (which th l=16 problem can’t do).

I am about to head out to catch a flight. I may be out of the loop more or less apart from a light Slack presence until Friday.

Best, B

On May 1, 2017, at 6:02 AM, Martin Ueding notifications@github.com wrote:

I just wanted to test whether the KNL version still compiles on KNL with the Intel Compiler on the system in Bologna. But their Intel license server is currently down. And today is a national holiday in Italy, so no support on their side.

If any of you would compile the travis-avx512 branch on some other KNL machine with Intel C++ and run all the tests, it would be great.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.


Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

martin-ueding commented 7 years ago

I think that should be resolved now.