Closed mattkretz closed 7 years ago
The place to implement it starts at https://github.com/VcDevel/Vc/blob/master/avx/vector.tcc#L390. Probably start with overloads, such as:
template <>
inline void AVX2::float_v::gatherImplementation(const float *mem, const IndexType &indexes)
{
const auto iv = simd_cast<AVX2::int_v>(indexes);
d.v() = _mm256_i32gather_ps (mem, iv.d.v(), sizeof(float));
}
Okay, thanks, this gives me the idea. I'll see how far I get and if I get stuck I'll get in touch again.
I forced my way in doing this, based on your initial hint:
#ifdef Vc_IMPL_AVX2
template <>
template <>
inline void AVX2::float_v::gatherImplementation
(const float *mem, const AVX2::int_v &indexes)
{
// don't need this, have int_v already:
// const auto iv = simd_cast<AVX2::int_v>(indexes);
// can't do this because indexes.d is protected:
// d.v() = _mm256_i32gather_ps (mem, indexes.d.v(), sizeof(float));
// suppose this does the trick:
d.v() = _mm256_i32gather_ps (mem, indexes.data(), sizeof(float));
}
template <>
template <>
inline void AVX2::float_v::gatherImplementation
(const float *mem, const SimdArray<int,AVX2::float_v::Size> &indexes)
{
// simd_cast doesn't work here:
// const auto iv = simd_cast<AVX2::int_v>(indexes);
// so instead doing this, but compiler asks for simd_cast instead
// construction is dodgy, though on AVX2 the sizes do match
AVX2::int_v iv ( indexes ) ;
// would feed indexes.data() as second argument, but no joy:
// error: invalid use of incomplete type ‘const class Vc_1::SimdArray<int, 8ul,
// Vc_1::Vector<int, Vc_1::VectorAbi::Avx>, 8ul>’
// meaning of which I don't get, so stick with iv.data()
d.v() = _mm256_i32gather_ps (mem, iv.data(), sizeof(float));
}
this code runs, but of course I'm not happy with the second overload using a SimdArray for the indexes, which is what I need for my code. with this code in place, I don't get any speedup for my use case, so the data are probably too widely scattered.
Please tell me if the first one is fine and propose something for the second one.
Kay
While I didn't get a speedup for my prefiltering code, it looks like the evaluation code profits from using the intrinsic, I'm getting some 10-15% rendering time decrease here which looks great.
Kay
Using .data()
is the right solution, yes. For the SimdArray
overload, you could try a generic one that forwards to int_v
:
template <typename T>
template <>
inline void Vector<T, VectorAbi::Avx>::gatherImplementation(const T *mem,
const SimdArray<int, AVX2::Vector<T>::Size> &indexes)
{
gatherImplementation(mem, simd_cast<AVX2::int_v>(indexes));
}
That'll work correctly for T = {double, float, int, uint}
. For short
and ushort
you'll have to overload again with:
template <>
template <>
inline void Vector<short, VectorAbi::Avx>::gatherImplementation(const short *mem,
const SimdArray<int, AVX2::short_v::Size> &indexes)
{
// don't know yet ;-)
}
so I tried this in vector.tcc
template <>
template <typename T>
inline void Vector<T, VectorAbi::Avx>::gatherImplementation
( const T *mem,
const SimdArray<int, AVX2::Vector<T>::Size> &indexes )
{
gatherImplementation(mem, simd_cast<AVX2::int_v>(indexes));
}
together with this prototype in gatherinterface.h:
template <typename MT>
inline void gatherImplementation
( const MT *mem,
const SimdArray<int, AVX2::Vector<MT>::Size> &indexes ) ;
but it looks like there's something wrong with the prototype, because I get:
...
Vc/avx/vector.tcc:443:13: error: prototype for
‘void Vc_1::Vector<T, Vc_1::VectorAbi::Avx>::gatherImplementation
(const T, const Vc_1::SimdArray<int, Vc_1::Vector<T, Vc_1::VectorAbi::Avx>::Size>&)’
does not match any in class ‘Vc_1::Vector<T, Vc_1::VectorAbi::Avx>’
...
gatherinterface.h:66:17: error: candidates are:
template~~~~~~~
took me a while, but I think the issue is that T
is used for the class template specialization, which is not the way it was declared. So try this:
template <typename T>
inline void Vector<T, VectorAbi::Avx>::gatherImplementation
( const T *mem,
const SimdArray<int, AVX2::Vector<T>::Size> &indexes )
{
gatherImplementation(mem, simd_cast<AVX2::int_v>(indexes));
}
together with this prototype in gatherinterface.h:
inline void gatherImplementation
( const T *mem,
const SimdArray<int, AVX2::Vector<T>::Size> &indexes ) ;
Am 19.05.2017 um 00:56 schrieb Matthias Kretz:
took me a while, but I think the issue is that |T| is used for the class template specialization, which is not the way it was declared. So try this:
template <typename T> inline void Vector<T, VectorAbi::Avx>::gatherImplementation ( const T *mem, const SimdArray<int, AVX2::Vector<T>::Size> &indexes ) { gatherImplementation(mem, simd_cast<AVX2::int_v>(indexes)); }
together with this prototype in gatherinterface.h:
inline void gatherImplementation ( const T *mem, const SimdArray<int, AVX2::Vector<T>::Size> &indexes ) ;
No, doesn't work:
... error: invalid use of incomplete type ‘struct Vc_1::VectorTraits<float, Vc_1::VectorAbi::Avx> ...
sure goes on. make 2>&1 | wc 1771 6708 135585
I feel we're getting nowhere and wasting time. Maybe you could try out what you suggest your end? If I have a working template, I can take it from there, but if I run into errors straight away I am reduced to guesswork, and the code is very complex. I don't really want to understand all of it, you're in a much better position to make sense of the errors the compiler produces.
Kay
you're probably right. Can you please commit and push what you have, then I can take it from there.
Am 19.05.2017 um 10:40 schrieb Matthias Kretz:
you're probably right. Can you please commit and push what you have, then I can take it from there.
I don't think I can push anything to the Vc repo, please correct me if I'm wrong.
Anyway, all I had so far is this prototype in gatherinterface.h:
template <typename MT>
inline void gatherImplementation
( const MT *mem,
const SimdArray<int, AVX2::Vector<MT>::Size> &indexes ) ;
and this code in vector.tcc:
template <>
template <>
inline void AVX2::float_v::gatherImplementation
(const float *mem, const AVX2::int_v &indexes)
{
// don't need this, have int_v already:
// const auto iv = simd_cast
template <>
template <>
inline void AVX2::float_v::gatherImplementation
(const float *mem, const SimdArray<int,AVX2::float_v::Size> &indexes)
{
// simd_cast doesn't work here:
// const auto iv = simd_cast
Sorry for the long delay. Couldn't work for a week and then ... anyway. I'll take a stab at it. And I'll just give you commit rights, so if you want to push a feature branch, feel free.
can you please test my branch and see that it works for you? It should use gather intrinsics now for all supported entry types and even for converting loads on gathers. I have not done any benchmarks. It'd be great if you could share your results.
BTW, the whole implementation was so tricky because of the forwarding references used in the existing gatherImplementation
function signature. Forwarding references are "greedy", so it's hard to get overloading right. That's what my first commit changes. The second one adds all the gather overloads for AVX2.
Hi Matthias! I can't Test anything just now because I'm traveling and offline most of the time but once I get back in about two weeks time I'll Check it Out straight away! thank you for fixing it! Kay
Hi Matthias!
Sorry for taking so long to reply - now I'm back on my system and online for some time again.
Am 02.06.2017 um 17:32 schrieb Matthias Kretz:
can you please test my branch and see that it works for you? It should use gather intrinsics now for all supported entry types and even for converting loads on gathers.
For a start, I have compiled my panorama viewer with your new code. This software uses lots of gather instructions. Compiling, installing and linking with the code from your AVX2 gather intrinsics branch showed no problems.
I have not done any benchmarks. It'd be great if you could share your results.
I used the pano viewer with a mode for benchmarking which only renders the frames but does not display them. There is a significant speed improvement. If I render 1000 frames from a spherical source image using bilinear interpolation, I get a speedup of ca. 10%, which is quite a lot since there is much more going on in the processing pipeline which has nothing to do with gathering. I doubt that more than this single figure will be of use to anyone (since interpreting the data would require looking at the code I'm using) - if you have any specific benchmarking in mind you'd like to see let me know, maybe I can help.
So from these preliminary tests I'd say that the implementation of gathering via intrinsics on AVX2 is definitely worthwhile, and your code does the job.
Am 02.06.2017 um 17:33 schrieb Matthias Kretz:
BTW, the whole implementation was so tricky because of the forwarding references used in the existing |gatherImplementation| function signature. Forwarding references are "greedy", so it's hard to get overloading right. That's what my first commit changes. The second one adds all the gather overloads for AVX2.
I did wonder what your rationale was to use rvalue references for the indexes in the first place. I agree that your current choice of const references is a good idea.
Kay
Nice, thanks for the tests. It's interesting to see those numbers. Take a look at 5.4.1 in http://code.compeng.uni-frankfurt.de/attachments/13/Diplomarbeit.pdf. My (informed) guess-work in 2009 predicts such a speedup for a CERN track reconstruction code. The problem with gathers is, that they can easily stall the CPU without allowing "useful work" to happen in parallel. In principle you see Amdahl's Law at work. I.e. speedup of the gathers improves the parallel to non-parallel ratio.
If you're interested, you might want to experiment with a code that works with two (or more) independent dependency chains, that are interleaved in such a way that while one "chain" gathers data, the other "chain" is doing arithmetics.
In any case, looks like the branch is ready for merge to master. I'll look into another 1.x release then.
I did wonder what your rationale was to use rvalue references for the indexes in the first place.
Note that those are not really rvalue refs, they can be, but they can also be lvalue references. T&&
and auto&&
are so-called forwarding references, since T
/auto
will be deduced as either an lvalue or rvalue reference type. And since forwarding references can bind to any value category, overloads must match exactly (i.e. const
, volatile
, and rvalue/lvalue ref) otherwise the forwarding reference overload wins (which is why they are called greedy).
I used them here, because I wanted to have a transparent interface that forwards the value category. This can make a difference for user-defined types with "strange" conversion semantics in combination with move-only types. But... who wants to use such crazyness with a gather... :-)
Am 16.06.2017 um 14:48 schrieb Matthias Kretz:
Nice, thanks for the tests. It's interesting to see those numbers. Take a look at 5.4.1 in http://code.compeng.uni-frankfurt.de/attachments/13/Diplomarbeit.pdf. My (informed) guess-work in 2009 predicts such a speedup for a CERN track reconstruction code. The problem with gathers is, that they can easily stall the CPU without allowing "useful work" to happen in parallel. In principle you see Amdahl's Law at work. I.e. speedup of the gathers improves the parallel to non-parallel ratio.
Your use case, the track reconstruction, isn't even so far from my problem (though much more complex). When evaluating a B-spline many times to interpolate from an image or volume etc, you have a lot of DDA, but the location from where the coefficients are gathered often coincide or aren't too far apart, so chances are that the gather doesn't span too many cache lines.
If you're interested, you might want to experiment with a code that works with two (or more) independent dependency chains, that are interleaved in such a way that while one "chain" gathers data, the other "chain" is doing arithmetics.
I would be interested in anything which can make my code faster! But I can't quite figure out how to apply your advice to my problem. Evaluating a B-spline takes two steps: the first step is obtaining the 'weights' (by means of a matrix multiplication), and the second step is to apply the weights to a small subset of the coefficients in the vicinity of the interpolation location. This weighted summation results in a single interpolated value, and every value needs different weights and, possibly, a different coefficient window. The size of the window of coefficients increases with the spline's degree. So the operation uses many values input to produce one value output; buffering the input would be counterproductive - the whole reduction is performed with recursive code and keeps only the smallest state possible, but it is fully horizontally vectorized and sucks in weights and coefficients with load and gather operations, hence the speedup. But the gather is into nD data and therefore it always reads from several cache lines. The gathering and the arithmetic are very closely interwoven and I see no way to disentangle the two. If you like, have a glance at the implementation of one of my my vectorized evaluation routines at
In any case, looks like the branch is ready for merge to master. I'll look into another 1.x release then.
Well done. Big thumbs up. I've spent countless hours making my frame rendering times a few hundred microseconds shorter, and you've managed to shave off a full millisecond! Good thing I was mystified by the lack of speedup using the gathers - I wouldn't have thought they were plain not implemented!
I did wonder what your rationale was to use rvalue references for the indexes in the first place.
Note that those are not really rvalue refs, they can be, but they can also be lvalue references. |T&&| and |auto&&| are so-called forwarding references, since |T|/|auto| will be deduced as either an lvalue or rvalue reference type. And since forwarding references can bind to any value category, overloads must match exactly (i.e. |const|, |volatile|, and rvalue/lvalue ref) otherwise the forwarding reference overload wins (which is why they are called greedy).
You are, of course, right. I'm slow to grasp these new concepts, having learnt C++ a long time ago, but now I've read up on the subject and I am coming closer to understanding it.
With regards Kay
Reference: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gather&techs=MMX,SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,FMA