Provide an utility function to optimally iterate with the appropriate pack size

gnzlbg commented 7 years ago

Having to partition the raw loops manually all the time seems repetitive an error prone.

I would like an boost::simd::iterate(size_t{from}, size_t{to}, binary_fun) function, that instantiates binary_fun for the appropriate pack sizes. First I show how the nbody example can be rewritten to become faster and correct (currently it only works for particle size multiple of the pack size), then I show how iterate can be implemented

The current n-body SIMD example looks like this:

void nbody_step(particles& ps)
{
  pack_t grav_con{6.67408e-11};
  pack_t epsilon{0.00125f};
  for (std::size_t i = 0; i < ps.size(); i += pack_t::static_size) {
    pack_t ax{0};
    pack_t ay{0};
    pack_t az{0};
    pack_t pix{&ps.x[i]};
    pack_t piy{&ps.y[i]};
    pack_t piz{&ps.z[i]};
    pack_t pim{&ps.m[i]};
    for (std::size_t j = i + 1; j < ps.size(); ++j) {
      auto pjx = bs::load<pack_t>(&ps.x[j]);
      auto pjy = bs::load<pack_t>(&ps.y[j]);
      auto pjz = bs::load<pack_t>(&ps.z[j]);
      auto dx = pjx - pix;
      auto dy = pjy - piy;
      auto dz = pjz - piz;
      auto inorm  = grav_con / bs::sqrt(dx * dx + dy * dy + dz * dz + epsilon);
      auto inorm3 = inorm * inorm * inorm;
      auto fi = bs::load<pack_t>(&ps.m[j]) * inorm3;
      auto fj = pim * inorm3;
      ax += dx * fi;
      ay += dy * fi;
      az += dz * fi;
      auto pjvx = bs::load<pack_t>(&ps.vx[j]);
      pjvx -= dx * fj;
      bs::store(pjvx, &ps.vx[j]);
      auto pjvy = bs::load<pack_t>(&ps.vy[j]);
      pjvy -= dy * fj;
      bs::store(pjvy, &ps.vy[j]);
      auto pjvz = bs::load<pack_t>(&ps.vz[j]);
      pjvz -= dz * fj;
      bs::store(pjvz, &ps.vz[j]);
    }
    pack_t pivx{&ps.vx[i]};
    pack_t pivy{&ps.vy[i]};
    pack_t pivz{&ps.vz[i]};
    pivx += ax;
    pivy += ay;
    pivz += az;
    pix += pivx;
    piy += pivy;
    piz += pivz;
    bs::aligned_store(pivx, &ps.vx[i]);
    bs::aligned_store(pivy, &ps.vy[i]);
    bs::aligned_store(pivz, &ps.vz[i]);
    bs::aligned_store(pix, &ps.x[i]);
    bs::aligned_store(piy, &ps.y[i]);
    bs::aligned_store(piz, &ps.z[i]);
  }
}

This only works if the number of particles is a multiple of the pack size. Rewriting it to use the max pack-size until the remainder, and then handle the remainder by using pack-size / 2 first, and then handling the remainder of that by then using pack-size /4.... until pack-size == 1, seems very repetitive and error prone.

I would like this example to be rewritten to something like this:

void nbody_step(particles& ps)
{
  pack_t grav_con{6.67408e-11};
  pack_t epsilon{0.00125f};
  simd_iterate(0, ps.size(), [](auto i, auto pack_type) {
    using pack_t = delctype(pack_type);
    pack_t grav_con{6.67408e-11};
    pack_t epsilon{0.00125f};
    pack_t ax{0};
    pack_t ay{0};
    pack_t az{0};
    pack_t pix{&ps.x[i]};
    pack_t piy{&ps.y[i]};
    pack_t piz{&ps.z[i]};
    pack_t pim{&ps.m[i]};
    simd_iterate(i + 1, ps.size(), [](auto j, auto pack_type2) {
      using pack2_t = delctype(pack_type2);
      auto pjx = bs::load<pack2_t>(&ps.x[j]);
      auto pjy = bs::load<pack2_t>(&ps.y[j]);
      auto pjz = bs::load<pack2_t>(&ps.z[j]);
      auto dx = pjx - pix;  // note: this requires pack2_t - pack_t to work!
      auto dy = pjy - piy;
      auto dz = pjz - piz;
      auto inorm  = grav_con / bs::sqrt(dx * dx + dy * dy + dz * dz + epsilon);
      auto inorm3 = inorm * inorm * inorm;
      auto fi = bs::load<pack2_t>(&ps.m[j]) * inorm3;
      auto fj = pim * inorm3;
      ax += dx * fi;
      ay += dy * fi;
      az += dz * fi;
      auto pjvx = bs::load<pack2_t>(&ps.vx[j]);
      pjvx -= dx * fj;
      bs::store(pjvx, &ps.vx[j]);
      auto pjvy = bs::load<pack2_t>(&ps.vy[j]);
      pjvy -= dy * fj;
      bs::store(pjvy, &ps.vy[j]);
      auto pjvz = bs::load<pack2_t>(&ps.vz[j]);
      pjvz -= dz * fj;
      bs::store(pjvz, &ps.vz[j]);
    });
    pack_t pivx{&ps.vx[i]};
    pack_t pivy{&ps.vy[i]};
    pack_t pivz{&ps.vz[i]};
    pivx += ax;
    pivy += ay;
    pivz += az;
    pix += pivx;
    piy += pivy;
    piz += pivz;
    bs::aligned_store(pivx, &ps.vx[i]);
    bs::aligned_store(pivy, &ps.vy[i]);
    bs::aligned_store(pivz, &ps.vz[i]);
    bs::aligned_store(pix, &ps.x[i]);
    bs::aligned_store(piy, &ps.y[i]);
    bs::aligned_store(piz, &ps.z[i]);
  });
}

which requires being able to mix different pack-sizes in the operations. The simd_iterate function, would, for a max pack size of 8, instantiate the closures 4 times, for sizes of 8, 4, 2, 1, and handle the remainder of the loop for size 8, by recursively calling itself with a starting simd pack of 4:

template<typename F, std::size_t SIMD_PACK_SIZE = pack::static_size>
void simd_iterate(size_t from, size_t to, F&& f, std::integral_constant<size_t, SIMD_PACKSIZE> = {}) {
  size_t end = // compute the end of the loop for SIMD_PACK_SIZe
  for (size_t i = from, e = end; i < e; ++i) f(i, pack_t<SIMD_PACK_SIZE>) 
  if (end == to) { return; }
  return simd_iterate(end, to, std::forward<F>(f), std::integral_constant<size_t, SIMD_PACKSIZE/2>{});
}

jfalcou commented 7 years ago

Multiple point :

operations between pack of different size (except for some special functions) don't follow the SIMD paradigm and can't be supported
Some architectures have penalties when switching pack size OR don't support more than one size efficiently
the nbody example is simple and don't deals with non-exact size for the sake of simplicity.

You want to play around with segmented_output_range and/or segmented_input_range that slice an arbitrary contiguous block of memory into a scalar prologue/epilogue and a SIMD-izable block. Have a look at transform to get a gist on how ot use it.

Now, this all points to the fact we need to promote those rnage adaptors first and foremost. Also, we have a WIP to add more algorithm liek for_each and find_if which also may helps.

gnzlbg commented 7 years ago

Thanks for the suggestions, I'll get started with those adaptors right away!

agenium-scale / boost.simd

Provide an utility function to optimally iterate with the appropriate pack size #450