VcDevel / Vc

SIMD Vector Classes for C++
BSD 3-Clause "New" or "Revised" License
1.45k stars 152 forks source link

Implement gathers with AVX2 intrinsics #32

Closed mattkretz closed 7 years ago

mattkretz commented 9 years ago

Reference: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gather&techs=MMX,SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,FMA

mattkretz commented 7 years ago

The place to implement it starts at https://github.com/VcDevel/Vc/blob/master/avx/vector.tcc#L390. Probably start with overloads, such as:

template <>
inline void AVX2::float_v::gatherImplementation(const float *mem, const IndexType &indexes)
{
  const auto iv = simd_cast<AVX2::int_v>(indexes);
  d.v() = _mm256_i32gather_ps (mem, iv.d.v(), sizeof(float));
}
kfjahnke commented 7 years ago

Okay, thanks, this gives me the idea. I'll see how far I get and if I get stuck I'll get in touch again.

kfjahnke commented 7 years ago

I forced my way in doing this, based on your initial hint:

    #ifdef Vc_IMPL_AVX2
    template <>
    template <>
    inline void AVX2::float_v::gatherImplementation
    (const float *mem, const AVX2::int_v &indexes)
    {
    //   don't need this, have int_v already:
    //   const auto iv = simd_cast<AVX2::int_v>(indexes);
    //   can't do this because indexes.d is protected:
    //   d.v() = _mm256_i32gather_ps (mem, indexes.d.v(), sizeof(float));
    //   suppose this does the trick:
      d.v() = _mm256_i32gather_ps (mem, indexes.data(), sizeof(float));
    }

    template <>
    template <>
    inline void AVX2::float_v::gatherImplementation
    (const float *mem, const SimdArray<int,AVX2::float_v::Size> &indexes)
    {
    // simd_cast doesn't work here:
    // const auto iv = simd_cast<AVX2::int_v>(indexes);
    // so instead doing this, but compiler asks for simd_cast instead
    // construction is dodgy, though on AVX2 the sizes do match
      AVX2::int_v iv ( indexes ) ;
    // would feed indexes.data() as second argument, but no joy:
    //    error: invalid use of incomplete type ‘const class Vc_1::SimdArray<int, 8ul,
    //    Vc_1::Vector<int, Vc_1::VectorAbi::Avx>, 8ul>’
    // meaning of which I don't get, so stick with iv.data()
      d.v() = _mm256_i32gather_ps (mem, iv.data(), sizeof(float));
    }

this code runs, but of course I'm not happy with the second overload using a SimdArray for the indexes, which is what I need for my code. with this code in place, I don't get any speedup for my use case, so the data are probably too widely scattered.

Please tell me if the first one is fine and propose something for the second one.

Kay

kfjahnke commented 7 years ago

While I didn't get a speedup for my prefiltering code, it looks like the evaluation code profits from using the intrinsic, I'm getting some 10-15% rendering time decrease here which looks great.

Kay

mattkretz commented 7 years ago

Using .data() is the right solution, yes. For the SimdArray overload, you could try a generic one that forwards to int_v:

template <typename T>
template <>
inline void Vector<T, VectorAbi::Avx>::gatherImplementation(const T *mem,
    const SimdArray<int, AVX2::Vector<T>::Size> &indexes)
{
  gatherImplementation(mem, simd_cast<AVX2::int_v>(indexes));
}

That'll work correctly for T = {double, float, int, uint}. For short and ushort you'll have to overload again with:

template <>
template <>
inline void Vector<short, VectorAbi::Avx>::gatherImplementation(const short *mem,
    const SimdArray<int, AVX2::short_v::Size> &indexes)
{
  // don't know yet ;-)
}
kfjahnke commented 7 years ago

so I tried this in vector.tcc

template <>
template <typename T>
inline void Vector<T, VectorAbi::Avx>::gatherImplementation
  ( const T *mem,
    const SimdArray<int, AVX2::Vector<T>::Size> &indexes )
{
  gatherImplementation(mem, simd_cast<AVX2::int_v>(indexes));
}

together with this prototype in gatherinterface.h:

template <typename MT>
inline void gatherImplementation
  ( const MT *mem,
    const SimdArray<int, AVX2::Vector<MT>::Size> &indexes ) ;

but it looks like there's something wrong with the prototype, because I get: ... Vc/avx/vector.tcc:443:13: error: prototype for ‘void Vc_1::Vector<T, Vc_1::VectorAbi::Avx>::gatherImplementation (const T, const Vc_1::SimdArray<int, Vc_1::Vector<T, Vc_1::VectorAbi::Avx>::Size>&)’ does not match any in class ‘Vc_1::Vector<T, Vc_1::VectorAbi::Avx>’ ... gatherinterface.h:66:17: error: candidates are: template template void Vc_1::Vector<T, Vc_1::VectorAbi::Avx> ::gatherImplementation (const MT, const Vc_1::SimdArray<int, Vc_1::Vector<MT, Vc_1::VectorAbi::Avx>::Size>&) inline void gatherImplementation ^~~~~~~~

mattkretz commented 7 years ago

took me a while, but I think the issue is that T is used for the class template specialization, which is not the way it was declared. So try this:

    template <typename T>
    inline void Vector<T, VectorAbi::Avx>::gatherImplementation
      ( const T *mem,
        const SimdArray<int, AVX2::Vector<T>::Size> &indexes )
    {
      gatherImplementation(mem, simd_cast<AVX2::int_v>(indexes));
    }

together with this prototype in gatherinterface.h:

    inline void gatherImplementation
      ( const T *mem,
        const SimdArray<int, AVX2::Vector<T>::Size> &indexes ) ;
kfjahnke commented 7 years ago

Am 19.05.2017 um 00:56 schrieb Matthias Kretz:

took me a while, but I think the issue is that |T| is used for the class template specialization, which is not the way it was declared. So try this:

template <typename T>
inline void Vector<T, VectorAbi::Avx>::gatherImplementation
  ( const T *mem,
    const SimdArray<int, AVX2::Vector<T>::Size> &indexes )
{
  gatherImplementation(mem, simd_cast<AVX2::int_v>(indexes));
}

together with this prototype in gatherinterface.h:

inline void gatherImplementation
  ( const T *mem,
    const SimdArray<int, AVX2::Vector<T>::Size> &indexes ) ;

No, doesn't work:

... error: invalid use of incomplete type ‘struct Vc_1::VectorTraits<float, Vc_1::VectorAbi::Avx> ...

sure goes on. make 2>&1 | wc 1771 6708 135585

I feel we're getting nowhere and wasting time. Maybe you could try out what you suggest your end? If I have a working template, I can take it from there, but if I run into errors straight away I am reduced to guesswork, and the code is very complex. I don't really want to understand all of it, you're in a much better position to make sense of the errors the compiler produces.

Kay

mattkretz commented 7 years ago

you're probably right. Can you please commit and push what you have, then I can take it from there.

kfjahnke commented 7 years ago

Am 19.05.2017 um 10:40 schrieb Matthias Kretz:

you're probably right. Can you please commit and push what you have, then I can take it from there.

I don't think I can push anything to the Vc repo, please correct me if I'm wrong.

Anyway, all I had so far is this prototype in gatherinterface.h:

 template <typename MT>
 inline void gatherImplementation
   ( const MT *mem,
     const SimdArray<int, AVX2::Vector<MT>::Size> &indexes ) ;

and this code in vector.tcc:

ifdef Vc_IMPL_AVX2

template <> template <> inline void AVX2::float_v::gatherImplementation (const float *mem, const AVX2::int_v &indexes) { // don't need this, have int_v already: // const auto iv = simd_cast(indexes); // can't do this because indexes.d is protected: // d.v() = _mm256_i32gather_ps (mem, indexes.d.v(), sizeof(float)); // suppose this does the trick: d.v() = _mm256_i32gather_ps (mem, indexes.data(), sizeof(float)); }

template <> template <> inline void AVX2::float_v::gatherImplementation (const float *mem, const SimdArray<int,AVX2::float_v::Size> &indexes) { // simd_cast doesn't work here: // const auto iv = simd_cast(indexes); // so instead doing this, but compiler asks for simd_cast instead // construction is dodgy, though on AVX2 the sizes do match AVX2::int_v iv ( indexes ) ; // would feed indexes.data() as second argument, but no joy: // error: invalid use of incomplete type ‘const class Vc_1::SimdArray<int, 8ul, // Vc_1::Vector<int, Vc_1::VectorAbi::Avx>, 8ul>’ // meaning of which I don't get, so stick with iv.data() d.v() = _mm256_i32gather_ps (mem, iv.data(), sizeof(float)); }

mattkretz commented 7 years ago

Sorry for the long delay. Couldn't work for a week and then ... anyway. I'll take a stab at it. And I'll just give you commit rights, so if you want to push a feature branch, feel free.

mattkretz commented 7 years ago

can you please test my branch and see that it works for you? It should use gather intrinsics now for all supported entry types and even for converting loads on gathers. I have not done any benchmarks. It'd be great if you could share your results.

mattkretz commented 7 years ago

BTW, the whole implementation was so tricky because of the forwarding references used in the existing gatherImplementation function signature. Forwarding references are "greedy", so it's hard to get overloading right. That's what my first commit changes. The second one adds all the gather overloads for AVX2.

kfjahnke commented 7 years ago

Hi Matthias! I  can't Test anything just now  because I'm traveling and offline most of the time but once I get back in about two weeks time I'll Check it Out straight away! thank you for fixing it! Kay

-------- Ursprüngliche Nachricht --------
Von: Matthias Kretz
Datum:02.06.2017 17:32 (GMT+01:00)
An: VcDevel/Vc
Cc: kfjahnke <_kfj@yahoo.com>,Manual
Betreff: Re: [VcDevel/Vc] Implement gathers with AVX2 intrinsics (#32)
can you please test my branch and see that it works for you? It should use gather intrinsics now for all supported entry types and even for converting loads on gathers. I have not done any benchmarks. It'd be great if you could share your results. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
kfjahnke commented 7 years ago

Hi Matthias!

Sorry for taking so long to reply - now I'm back on my system and online for some time again.

Am 02.06.2017 um 17:32 schrieb Matthias Kretz:

can you please test my branch and see that it works for you? It should use gather intrinsics now for all supported entry types and even for converting loads on gathers.

For a start, I have compiled my panorama viewer with your new code. This software uses lots of gather instructions. Compiling, installing and linking with the code from your AVX2 gather intrinsics branch showed no problems.

I have not done any benchmarks. It'd be great if you could share your results.

I used the pano viewer with a mode for benchmarking which only renders the frames but does not display them. There is a significant speed improvement. If I render 1000 frames from a spherical source image using bilinear interpolation, I get a speedup of ca. 10%, which is quite a lot since there is much more going on in the processing pipeline which has nothing to do with gathering. I doubt that more than this single figure will be of use to anyone (since interpreting the data would require looking at the code I'm using) - if you have any specific benchmarking in mind you'd like to see let me know, maybe I can help.

So from these preliminary tests I'd say that the implementation of gathering via intrinsics on AVX2 is definitely worthwhile, and your code does the job.

Am 02.06.2017 um 17:33 schrieb Matthias Kretz:

BTW, the whole implementation was so tricky because of the forwarding references used in the existing |gatherImplementation| function signature. Forwarding references are "greedy", so it's hard to get overloading right. That's what my first commit changes. The second one adds all the gather overloads for AVX2.

I did wonder what your rationale was to use rvalue references for the indexes in the first place. I agree that your current choice of const references is a good idea.

Kay

mattkretz commented 7 years ago

Nice, thanks for the tests. It's interesting to see those numbers. Take a look at 5.4.1 in http://code.compeng.uni-frankfurt.de/attachments/13/Diplomarbeit.pdf. My (informed) guess-work in 2009 predicts such a speedup for a CERN track reconstruction code. The problem with gathers is, that they can easily stall the CPU without allowing "useful work" to happen in parallel. In principle you see Amdahl's Law at work. I.e. speedup of the gathers improves the parallel to non-parallel ratio.

If you're interested, you might want to experiment with a code that works with two (or more) independent dependency chains, that are interleaved in such a way that while one "chain" gathers data, the other "chain" is doing arithmetics.

In any case, looks like the branch is ready for merge to master. I'll look into another 1.x release then.

I did wonder what your rationale was to use rvalue references for the indexes in the first place.

Note that those are not really rvalue refs, they can be, but they can also be lvalue references. T&& and auto&& are so-called forwarding references, since T/auto will be deduced as either an lvalue or rvalue reference type. And since forwarding references can bind to any value category, overloads must match exactly (i.e. const, volatile, and rvalue/lvalue ref) otherwise the forwarding reference overload wins (which is why they are called greedy). I used them here, because I wanted to have a transparent interface that forwards the value category. This can make a difference for user-defined types with "strange" conversion semantics in combination with move-only types. But... who wants to use such crazyness with a gather... :-)

kfjahnke commented 7 years ago

Am 16.06.2017 um 14:48 schrieb Matthias Kretz:

Nice, thanks for the tests. It's interesting to see those numbers. Take a look at 5.4.1 in http://code.compeng.uni-frankfurt.de/attachments/13/Diplomarbeit.pdf. My (informed) guess-work in 2009 predicts such a speedup for a CERN track reconstruction code. The problem with gathers is, that they can easily stall the CPU without allowing "useful work" to happen in parallel. In principle you see Amdahl's Law at work. I.e. speedup of the gathers improves the parallel to non-parallel ratio.

Your use case, the track reconstruction, isn't even so far from my problem (though much more complex). When evaluating a B-spline many times to interpolate from an image or volume etc, you have a lot of DDA, but the location from where the coefficients are gathered often coincide or aren't too far apart, so chances are that the gather doesn't span too many cache lines.

If you're interested, you might want to experiment with a code that works with two (or more) independent dependency chains, that are interleaved in such a way that while one "chain" gathers data, the other "chain" is doing arithmetics.

I would be interested in anything which can make my code faster! But I can't quite figure out how to apply your advice to my problem. Evaluating a B-spline takes two steps: the first step is obtaining the 'weights' (by means of a matrix multiplication), and the second step is to apply the weights to a small subset of the coefficients in the vicinity of the interpolation location. This weighted summation results in a single interpolated value, and every value needs different weights and, possibly, a different coefficient window. The size of the window of coefficients increases with the spline's degree. So the operation uses many values input to produce one value output; buffering the input would be counterproductive - the whole reduction is performed with recursive code and keeps only the smallest state possible, but it is fully horizontally vectorized and sucks in weights and coefficients with load and gather operations, hence the speedup. But the gather is into nD data and therefore it always reads from several cache lines. The gathering and the arithmetic are very closely interwoven and I see no way to disentangle the two. If you like, have a glance at the implementation of one of my my vectorized evaluation routines at

https://bitbucket.org/kfj/vspline/src/58db62df62ea017391f4f4344b4ffc6e78b7ee60/eval.h?at=master&fileviewer=file-view-default#eval.h-1176

In any case, looks like the branch is ready for merge to master. I'll look into another 1.x release then.

Well done. Big thumbs up. I've spent countless hours making my frame rendering times a few hundred microseconds shorter, and you've managed to shave off a full millisecond! Good thing I was mystified by the lack of speedup using the gathers - I wouldn't have thought they were plain not implemented!

I did wonder what your rationale was to use rvalue references for
the indexes in the first place.

Note that those are not really rvalue refs, they can be, but they can also be lvalue references. |T&&| and |auto&&| are so-called forwarding references, since |T|/|auto| will be deduced as either an lvalue or rvalue reference type. And since forwarding references can bind to any value category, overloads must match exactly (i.e. |const|, |volatile|, and rvalue/lvalue ref) otherwise the forwarding reference overload wins (which is why they are called greedy).

You are, of course, right. I'm slow to grasp these new concepts, having learnt C++ a long time ago, but now I've read up on the subject and I am coming closer to understanding it.

With regards Kay