Maratyszcza / NNPACK

Acceleration package for neural networks on multi-core CPUs
BSD 2-Clause "Simplified" License
1.66k stars 317 forks source link

AVX Support #27

Open Darwin2011 opened 7 years ago

Darwin2011 commented 7 years ago

Hi, @Maratyszcza

I want to support NNPACK for Intel SNB and IVB platform. I hope to reuse your current PeachPy kernel and use Macros in kernels as the following. #if defined(AVX) // code for avx #elif defined(AVX2) // code for avx2 #else // common code for scalar instruction... #endif

May I know whether PeachPy support this kind of preprocessor? Thanks

Best Regards

Maratyszcza commented 7 years ago

PeachPy does support this kind of features, but it wouldn't help for porting. AVX2+FMA3 kernels in NNPACK assume that FMA has the same cost as floating-point and addition. As one example, they fuse multiplications by constant into subsequent FFT butteflies, i.e.

b *= c
a, b = a + b, a - b

becomes

a, b = fma(b, c, a), fnma(b, c, a)

Note that this operation increases the number of FLOPS, and if you "lower" FMA into multiplication + addition (e.g. via feature checks), the resulting code would be suboptimal. If you want to support NNPACK on AVX(1) processors, you'd better start with PSIMD kernels and port them to PeachPy assembly for AVX1.

Maratyszcza commented 7 years ago

Also, to clarify: I currently have no plans for AVX support.

Darwin2011 commented 7 years ago

@Maratyszcza

Update two typos...

I can try to do this if you can give some help. My thought is to port PSIMD to PeachPy for AVX then use ifdef in PeachPy x86-64 kernel to determine instruction-set in code compiling time.

One Question: your example about suboptimall code for b *= c a, b = a + b, a - b is from PSIMD or PeachPy x86_64 Kernel? I think it should cause from PSIMD. If using Peachpy AVX assembly, there's no such issue.

I can write PeachPy assembly to support AVX but I hope to use Macro like #ifdef to fuse kernels for different isa into one in case of redundant code.

I don't have too much knowledge for your projects. Please give me your comments and points out my faults.

Best Regards

kruus commented 7 years ago

Last summer I was porting some stuff to NEC SX supercomputer and found pretty decent performance was possible using almost all C code. I began with the psimd stuff, wrote an all C version (changed data layout or something), and got decently fast with icc (gcc did not do as well with simd-ization). When I did so (after much experimentation) I found the only piece for Intel CPU that really, really needed assembler was the transpose routine. This got me to a good initial implementation for a non-AVX (even non-Intel) chip.

Wrote a ton of 'c' versions, ran them under icc, g++ and sx compilers and found which ones worked best on each system. Here's what I got for the kernel-y pieces with icc on my desktop:

Compiled with icc: ./winotime2 -a0v3 -i3r128c128 -k1l3m3n3 verbose=3 alg=0 WinoData<6,3>(1[8x8],1[3x3], kd=1) 1×8x8 Krn: 1×3x3x1 Out: 1x6x6(pad 1) Parms: 27

         Routine                                         |   Time

      nnp_iwt8x8_3x3_and_store__avx2 | μs lo 0.0399962       ±9.34035e-06 
      nnp_kwt8x8_3x3_and_store__avx2 | μs lo 0.0310441       ±9.06492e-06 
      nnp_owt8x8_3x3_with_bias__avx2 | μs lo 0.0380493       ±3.44596e-06 
      and with "cwt8x8" algorithm:      | xx
      nnp_iwt8x8_3x3_cwt_opt            | μs lo 0.0444428       ±8.52559e-06 
      nnp_kwt8x8_3x3_cwt_opt            | μs lo 0.0383232       ±6.43686e-05
      nnp_owt8x8_3x3_with_bias_cwt_opt  | μs lo 0.0396578       ±4.19575e-06 

Conclusion: with some delicate tinkering (I timed a couple of dozen "equivalent" C implementations and selected the fastest for each system/compiler) you can get timing not too far off from NNPack __avx2 routines.

Be warned gcc did considerably worse than icc. I did look at the assembler outputs, but don't remember offhand where gcc was having issues.

For comparison, inference timings (again on single tile) for nnpack driver were algorithm = nnpconvolution algorithm enum

   alg | time

 cwt8x8 |  μs lo 3.22182         ±0.001359 
  wt8x8 | μs lo 3.11652         ±0.00133278 
  ft8x8  |  μs lo 3.07279         ±0.00151075 

Here cwt8x8 was my final almost-all-C-code version of wt8x8. The nice thing is that to port to some other instruction set, you might do well just porting the 8x8 transpose. The ugly thing was that the impl is full of ifdefs for sx vs gcc vs icc compiler :(

(Wonder if simd-ize'ing across tiles you can port even easier? I need to read "Deep Tensor Convolution on Multicores" (Budden et al) more carefully, I think. https://arxiv.org/abs/1611.06565 )