makortel commented 5 years ago

We clearly need a good SoA abstraction to work with.

Some feature considerations for explicit memory

device memory owner
device memory "view" (arrays without the ownership; to be passed to kernels)
pinned host memory (for transfers to and from device)
regular host memory (for the output of CPU producer and the transfer-from-GPU producer)
transfers from host->device and device->host with minimal calls to cudaMemcpyAsync()
preferably only one definition of the arrays

With unified memory the situation would of course be simpler with

unified memory ownership
unified memory view
regular host memory

makortel commented 5 years ago

100 has an attempt with lots of duplication on the SoA definition side (so rather poor from programming model point of view).

makortel commented 5 years ago

In https://github.com/cms-patatrack/cmssw/pull/100#issuecomment-462651523 @VinInn commented

about SOA "abstractions": this is the best I managed to get so far: https://github.com/VinInn/ctest/blob/master/eigen/soaGeom.cpp previous incarnations (not very satisfactory to my taste): https://github.com/VinInn/ctest/blob/master/eigen/soaGeomV2.cpp https://github.com/VinInn/ctest/blob/master/eigen/soaGeomV1.cpp

they are all CUDA compliant

makortel commented 5 years ago

271 has an approach similar to https://github.com/cms-patatrack/cmssw/issues/272#issuecomment-464870138, i.e. a C struct.

Some downsides regarding general use

Size/capacity is defined at compile time
- which is what we mostly do now anyway
- but a SoA with ~~dynamic~~ run-time-known capacity would be more flexible, so could be useful in some cases
User is responsible in ensuring 128-byte alignment
- and in CPU side the the alignment requirement is shorter

VinInn commented 5 years ago

In production there is no need of runtime sizing: if one size does not fill all, we just compile different size and template everything. (for HLT I would like to see first a REAL USE CASE for runtime sizing) dynamic allocation (i.e. resizing) for a SOA is simply not an option

I was thinking to enforce alignment at cacheline level (if not a page level) These are not typically large block of memory... at the moment one can use "alignas" but at the end one will have to useeither the c++17 "new" with alignment or aligned_alloc (AND YES, better the user know that) on cuda side my understanding is that everything is taken care

fwyzard commented 5 years ago

I would point out that "dynamic allocation" and "resizing" are different: we can certainly have data structures (including SoA's) with a size determined at construction time, that do not support resizing.

I agree that, in general, we do not want to resize structures once the event data has been filled.

But I would say that we want to be able to handle collections of whose size we do not know a priory.

makortel commented 5 years ago

I would point out that "dynamic allocation" and "resizing" are different: we can certainly have data structures (including SoA's) with a size determined at construction time, that do not support resizing.

But I would say that we want to be able to handle collections of whose size we do not know a priory.

This was exactly the case I meant. I rephrased my earlier comment to avoid further confusion.

makortel commented 5 years ago

on cuda side my understanding is that everything is taken care

For the beginning of the memory block yes, but with

struct Foo {
  float pt[100];
  float eta[100];
};

the element Foo::eta would not be 128-byte aligned.

VinInn commented 5 years ago

so we need to static assert that N is a multiple of 128 (which is sane anyway)

makortel commented 5 years ago

so we need to static assert that N is a multiple of 128 (which is sane anyway)

Yup. But the alignment is still a user's responsibility.

fwyzard commented 5 years ago

What about std::aligned_storage and/or alignas(...) ?

VinInn commented 5 years ago

What about std::aligned_storage and/or alignas(...) ? sure
template<typenameM, intS>
struct alignas(128) ScalarSOA {
usingScalar = M; 

constexprScalar & operator()(uint32_t i) { returndata[i];} constexprconstScalar operator()(uint32_t i) const{ returndata[i];}

Scalar data[S]; static_assert(sizeof(data)%128==0); };



see

https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAKxAEZSAbAQwDtRkBSAJgCFufSAZ1QBXYskwgA5NwDMeFsgYisAag6yAwsgC2TAgg3YOABgCCchUpWZ1W5IIL4WBI6Ytd5i5Wo2bseMCYLAD0miSYbuaW3jZ2/oHBIQDimKg6mATEAJ5RFtGeVj62fgBumMhExHkx1r5aGTokubLGBV51JVpMDMAkhAg6NeZoLI6YAB4ADsSqIgoEslwA%2BgSqehMAchAAlOoA7DyqxJliLKoArABUtCZcACwafPsAIu4iggrAqgBqsnYvVQBIIsEAgH4VKqyABmT3enzYv1oFwBQMSoJAAFl9MQ8BM/NCGKh9KRkaS3LI%2BNFqfkzI5iCJKqoACoAZQA8qyDlSzKpfv8pqhPgQ8KgWHDzHy/ut0sECCIhpT3FLkao0KUmLjWBIJbzVITiWtkAhNUFdRxXubzARMDopswbX4CNkpsEmBlVJjSKoFqpWXl6Yy1j1AiwmIIILQuAAOPbYrJ4jm8i08vkfL5%2B5A9TWo52usMezFg1lZ5jVSnqSVzBHfbFTVHA4Jgut%2BL2qEzexsY1kJrCtsEAJVQAHdBGYCOE7XgGJhmXgMldWaR/a07MYlVX04jNHWG%2Bjm0wpn5Ro5Pd6O2iQcXe5EtEWQEPR%2BPJ1Np7P55hF8ujGvdSEQmmNaeoee5Xlih6tuenb7iAACSLAsJgxA9rifZaCu67rjy/6ARmO4gRogJdgeR72GKp5thexFwQhSEoXgaGaBhFI8u4ypquRNrTLMu6oK6xD6CQuwQPMLhLKsPp7NyxynMQ5x1hA6D6Ew/B4DscKWlWJ5cTMqj4fWfFIYJxDCaJiwrGsakcWMawpjJ8pyXpClKQQKm8GpGlvDSfIltmswuSpFw8KyVz3o%2BY4Tukr4znOC73uEDARS%2Bb6xZEFxvBueqOPoeDIMs4aCEhBAQJ8ABeaTQopynqVwFxRtGhGESY6mZRaGWsdatr2vot6aHmbqFt6vrMeYgZMiGwBhhG9V7L5ZZJty7Fbt8c05oRnpPJWtJ8tpky6atszcAAbKohkCVUpkLOJllSXZJwOecAUcEFeDPe1bXsbt3HWaeB2nfxxmXWJFmST9tmHPZZyqE9L1vZ5bFVn9MPBXDmV8tlIp5QVRUlXg5WoJVAU1XVMaNbILzNealqtTSNp2g6vXDQGWRBqoE1TZGMZ7GynLCEwi1VvGuITEmfh/N%2Bq6CsKorimjwEJiL7IWFo4sYTKGQuAqup8kLiZK2LZJq%2Bqmp4Nqt6ppmfmi1oBokkbJrEGarXUx1ZhMCIRCqAARvMDDoLsAt6ny91Q446BgnoADWmDLCILB4AAjiIvU86yfN%2BBs2wtZhuxU15tJjWsPCoBMgd8tRZjQtCCiYC%2BTCVDCqhZKwgjQiQipHOx0omgw0IAOoMQYVp6nT3WOlo/UFrYAAKsh5DtnF7bMZnXT6YwMZgECz2D6hcCdUx7NpZd8qo/7u57h3k6oEAHwAdJqAnZLs99exGOwALQ9/3g8IPfxCP7nOWfJz6oGOKiCAzcxht2IDoK4d8H5MCfjsF%2Bb9P49G/ugAwf8AEtQtsHWS5wIAmUploSmvAiGRhapoMhPAKFcCoSQvBp8AIh0clASBrd25wOQQgpBKCA5%2BD5F/AemDf68N2DwhgDBAEWw%2BtEF2CNaTuC%2BrpFeIMWDMgEsgSOghUT3BMAATiOsPUoqAGLQ1QGYTkJUnBgnKJUEgfhU4/mOs3KgghvTF1LkfY63tvRhzsZCRxWgzLRlWC4veMlBB7BTIo9GtiQAcOgToCBAl3G3y9pgYAChdikDcYIW%2BwR/Y7FICcApmTsksFyQjAxpgDHqCCsdN6EAQFg18QQdSENWHnC9rfBQnwsAQNvlLQgMtcFtXGTSORSjzCmPMegVASYICp3Tg0JgWxAHYB3u0tJrJiSeJLtsyJXtvRqIklcKJMTDhxN3rIGYTBgB6FUMkTQmgfSlCwEeKs0Dr6tNeuTEwABVTa/zNAaK0Toza/BVJXKYWU56PBXrpVRL0/pG9UlMCoHspgwyhSjLFBANSky9TTL5GxTSZgpAlIYNIC4UhSCgikCYelqBpCaGhbwVQwgxASFubQelBBpDMp2CUyOIALhcFvlwfYRijr7AeBcfY9x9jRkYNIe49LGXMtIKyqQ9LBAgA7IKplVLSBwFgEgNAU4ZxkAoBAK10UkIgGANGLgpBq4MBtMQA1EAvZCvpV7BQmpsjSH5aQK1GsCDshYAwENJrSBYD0GwGc/qE14BOJUPA5QDXxsmBUD2kgpBhoWJgGlRb6UMDwF7ASORwhYFDQK3EOh/XUtYOwDlAhK2v3gBAKeIBMDolIOUYgiSGQsEjiKnVUwRTkWkO/dk0JwxrHfnoUoEwSbQlUCu6Eeh9WiHEJIWg1LaWatTbqiY0YjrvyOvcVQwBkDIFUNGKV19cCEBIHy70yUbV8r2Oy3g/ABUttFXQSV9wuAGKVfsC4JgjoxnuLIWQR01VSA1Qys90h9WGryS2s1MBEAgFEAQKYHtyCUAdW%2BEd9BMD4CqHQUgw4BJTAbceqQdL0Pxt1XIWgqhhwDFUBeq9N670PqfVKoDJrJ1ir3rffY%2Bw4OyFoHJ6DFwb0IZQ2hrVLLMNCGw8a4VeGe2WqipRsj9qTM2pQMwNgqwx2R3ddOL1Pq/XxsDWGHIDbw3pEjdG2NqbE1tpTfG/AGaRTZtTXm5ABbPMlrLdqrtNbsh1sLWGhMzby2trYCgDtjAq0GtgH2gdIIh1OpZuOydfEZ1jDnQupde6eWHtY%2BxrTOrpCCevbepQbbr5lYna%2B2jH7uNfos0hW59C9Idok8KkDR1b4mHuEY/RDwTBLAMRB2QGnT2cZ0wao1uHzUEaIyRggZmKOWbdTR99VGGNMZYyh5rGGZCeC4Lx/j7XhNdcRKkuOE6ptUpA2Bm9%2BxaBLFkCYEwrrkTRnuJtjj2rdVYb25J1jXAtvw50/p/7JXvUyxAPcIAA%3D%3D

VinInn commented 5 years ago

a possible solution to keep the SOA fixed size and allow dynamic size setting is to bucketize (ASOA) see https://godbolt.org/z/-_aqEP or https://github.com/VinInn/ctest/blob/master/eigen/trivialASOA.cpp

provided the stride is a power of 2 it is quite trivial (and fast) to iterate and do random access....

VinInn commented 5 years ago

and here the cuda version https://godbolt.org/z/VjCSRw or https://github.com/VinInn/ctest/blob/master/eigen/trivialASOA.cu

fwyzard commented 5 years ago

step 1: what it might look like

The fields are something like

struct {
    double x, y, z;
    uint16_t colour;
    int32_t value;
};

definition

// compile-time sized SoA
template <size_t SIZE, size_t ALIGN=0>
struct SoA {

  // these could be moved to an external type trait to free up the symbol names
  static const size_t size = SIZE;
  static const size_t alignment = ALIGN;

  // AoS-like (row-wise) accessor to individual elements
  struct element {
    element(SoA & soa, size_t index) :
      x(soa.x[index]),
      y(soa.y[index]),
      z(soa.z[index]),
      colour(soa.colour[index]),
      value(soa.value[index]),
    { }

    double & x;
    double & y;
    double & z;
    uint16_t & colour;
    int32_t & value;
  };

  // AoS-like (row-wise) access to individual elements
  element operator[](size_t index) { return element(*this, index); }

  // data members
  alignas(ALIGN) double x[SIZE];
  alignas(ALIGN) double y[SIZE];
  alignas(ALIGN) double z[SIZE];
  alignas(ALIGN) uint16_t colour[SIZE];
  alignas(ALIGN) int32_t value[SIZE];
};

sample usage

int main(void) {
  // 10 elements, aligned on 32-bytes boundary
  SoA<10, 32> soa;

  // SoA-like access
  soa.z[7] = 42;

  // AoS-like access
  soa[7].z = 42;

  // check that SoA-like and SoA-like access the same elements
  assert(& soa.z[7] == & (soa[7].z));
}

fwyzard commented 4 years ago

step 2: use private members and public accessors

Note that we could also use soa.z(7) instead of soa.z()[7] if people think it would be cleaner.

definition

// compile-time sized SoA
template <size_t SIZE, size_t ALIGN=0>
struct SoA {

  // these could be moved to an external type trait to free up the symbol names
  static const size_t size = SIZE;
  static const size_t alignment = ALIGN;

  // AoS-like (row-wise) accessor to individual elements
  struct element {
    element(SoA & soa, size_t index) :
      soa_(soa),
      index_(index)
    { }

    double & x() { return soa_.x_[index_]; }
    double & y() { return soa_.y_[index_]; }
    double & z() { return soa_.z_[index_]; }
    uint16_t & colour() { return soa_.colour_[index_]; }
    int32_t & value() { return soa_.value_[index_]; }

    SoA & soa_;
    const size_t index_;
  };

  // AoS-like (row-wise) access to individual elements
  element operator[](size_t index) { return element(*this, index); }

  // accessors
  double* x() { return x_; }
  double* y() { return y_; }
  double* z() { return z_; }
  uint16_t* colour() { return colour_; }
  int32_t* value() { return value_; }

private:
  // data members
  alignas(ALIGN) double x_[SIZE];
  alignas(ALIGN) double y_[SIZE];
  alignas(ALIGN) double z_[SIZE];
  alignas(ALIGN) uint16_t colour_[SIZE];
  alignas(ALIGN) int32_t value_[SIZE];
  alignas(ALIGN) const char* name_[SIZE];
};

sample usage

int main(void) {
  // 10 elements, aligned on 32-bytes boundary
  SoA<10, 32> soa;

  // SoA-like access
  soa.z()[7] = 42;

  // AoS-like access
  soa[7].z() = 42;

  // check that SoA-like and SoA-like access the same elements
  assert(& soa.z()[7] == & (soa[7].z()));
}

fwyzard commented 4 years ago

step 3: make data member configurable

The syntax for the declaration has same flexibility, if one wants to play a bit with it.

Note: scroll right for the \ continuations.

definition

#include <boost/preprocessor/cat.hpp>
#include <boost/preprocessor/stringize.hpp>
#include <boost/preprocessor/seq/for_each.hpp>
#include <boost/preprocessor/variadic/to_seq.hpp>

/* declare SoA data members; these should exapnd to, for example
 *
 *   alignas(ALIGN) double x_[SIZE];
 *
 */

#define _DECLARE_SOA_DATA_MEMBER_IMPL(TYPE, NAME)                                                                                   \
  alignas(ALIGN) TYPE BOOST_PP_CAT(NAME, _[SIZE]);

#define _DECLARE_SOA_DATA_MEMBER(R, DATA, TYPE_NAME)                                                                                \
  BOOST_PP_EXPAND(_DECLARE_SOA_DATA_MEMBER_IMPL TYPE_NAME)

#define _DECLARE_SOA_DATA_MEMBERS(...)                                                                                              \
  BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_DATA_MEMBER, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))

/* declare SoA accessors; these should expand to, for example
 *
 *   double* x() { return x_; }
 *
 */

#define _DECLARE_SOA_ACCESSOR_IMPL(TYPE, NAME)                                                                                      \
  TYPE* NAME() { return BOOST_PP_CAT(NAME, _); }

#define _DECLARE_SOA_ACCESSOR(R, DATA, TYPE_NAME)                                                                                   \
  BOOST_PP_EXPAND(_DECLARE_SOA_ACCESSOR_IMPL TYPE_NAME)

#define _DECLARE_SOA_ACCESSORS(...)                                                                                                 \
  BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))

/* declare AoS-like element accessors; these should expand to, for example
 *
 *   double & x() { return * (soa_.x() + index_); }
 *
 */

#define _DECLARE_SOA_ELEMENT_ACCESSOR_IMPL(TYPE, NAME)                                                                              \
  TYPE & NAME() { return * (soa_. NAME () + index_); }

#define _DECLARE_SOA_ELEMENT_ACCESSOR(R, DATA, TYPE_NAME)                                                                           \
  BOOST_PP_EXPAND(_DECLARE_SOA_ELEMENT_ACCESSOR_IMPL TYPE_NAME)

#define _DECLARE_SOA_ELEMENT_ACCESSORS(...)                                                                                         \
  BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ELEMENT_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))

/* compile-time sized SoA
 */

#define DECLARE_SOA_TEMPLATE(CLASS, ...)                                                                                            \
template <size_t SIZE, size_t ALIGN=0>                                                                                              \
struct CLASS {                                                                                                                      \
                                                                                                                                    \
  /* these could be moved to an external type trait to free up the symbol names */                                                  \
  using self_type = CLASS;                                                                                                          \
  static const size_t size = SIZE;                                                                                                  \
  static const size_t alignment = ALIGN;                                                                                            \
                                                                                                                                    \
  /* introspection */                                                                                                               \
  static void dump() {                                                                                                              \
    std::cout << #CLASS "<" << SIZE << ", " << ALIGN << "): " << std::endl;                                                         \
    std::cout << "  sizeof(...): " << sizeof(CLASS) << std::endl;                                                                   \
    std::cout << "  alignof(...): " << alignof(CLASS) << std::endl;                                                                 \
    _DECLARE_SOA_DUMP_INFOS(__VA_ARGS__)                                                                                            \
    std::cout << std::endl;                                                                                                         \
  }                                                                                                                                 \
                                                                                                                                    \
  /* AoS-like accessor to individual elements */                                                                                    \
  struct element {                                                                                                                  \
    element(CLASS & soa, size_t index) :                                                                                            \
      soa_(soa),                                                                                                                    \
      index_(index)                                                                                                                 \
    { }                                                                                                                             \
                                                                                                                                    \
    _DECLARE_SOA_ELEMENT_ACCESSORS(__VA_ARGS__)                                                                                     \
                                                                                                                                    \
  private:                                                                                                                          \
    CLASS & soa_;                                                                                                                   \
    const size_t index_;                                                                                                            \
  };                                                                                                                                \
                                                                                                                                    \
  /* AoS-like accessor */                                                                                                           \
  element operator[](size_t index) { return element(*this, index); }                                                                \
                                                                                                                                    \
  /* accessors */                                                                                                                   \
  _DECLARE_SOA_ACCESSORS(__VA_ARGS__)                                                                                               \
                                                                                                                                    \
private:                                                                                                                            \
  /* data members */                                                                                                                \
  _DECLARE_SOA_DATA_MEMBERS(__VA_ARGS__)                                                                                            \
                                                                                                                                    \
}

sample usage

DECLARE_SOA_TEMPLATE(SoA,
  (double, x),
  (double, y),
  (double, z),
  (uint16_t, colour),
  (int32_t, value),
);

int main(void) {
  // 10 elements, aligned on 32-bytes boundary
  SoA<10, 32> soa;

  // SoA-like access
  soa.z()[7] = 42;

  // AoS-like access
  soa[7].z() = 42;

  // check that SoA-like and SoA-like access the same elements
  assert(& soa.z()[7] == & (soa[7].z()));
}

fwyzard commented 4 years ago

step 4: add scalar types

I realised that we may want to have scalar types, that are unique for the whole SoA, instead of per-element.

definition

#include <boost/preprocessor.hpp>

/* declare "scalars" (one value shared across the whole SoA) and "columns" (one vale per element) */

#define SoA_scalar(TYPE, NAME) (0, TYPE, NAME)
#define SoA_column(TYPE, NAME) (1, TYPE, NAME)

/* declare SoA data members; these should exapnd to, for columns:
 *
 *   alignas(ALIGN) double x_[SIZE];
 *
 * and for scalars:
 *
 *   double x_;
 *
 */

#define _DECLARE_SOA_DATA_MEMBER_IMPL(IS_COLUMN, TYPE, NAME)                                                                        \
  BOOST_PP_IIF(IS_COLUMN,                                                                                                           \
    alignas(ALIGN) TYPE BOOST_PP_CAT(NAME, _[SIZE]);                                                                                \
  ,                                                                                                                                 \
    TYPE BOOST_PP_CAT(NAME, _);                                                                                                     \
  )

#define _DECLARE_SOA_DATA_MEMBER(R, DATA, TYPE_NAME)                                                                                \
  BOOST_PP_EXPAND(_DECLARE_SOA_DATA_MEMBER_IMPL TYPE_NAME)

#define _DECLARE_SOA_DATA_MEMBERS(...)                                                                                              \
  BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_DATA_MEMBER, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))

/* declare SoA accessors; these should expand to, for columns:
 *
 *   double* x() { return x_; }
 *
 * and for scalars:
 *
 *   double& x() { return x_; }
 *
 */

#define _DECLARE_SOA_ACCESSOR_IMPL(IS_COLUMN, TYPE, NAME)                                                                           \
  BOOST_PP_IIF(IS_COLUMN,                                                                                                           \
    TYPE* NAME() { return BOOST_PP_CAT(NAME, _); }                                                                                  \
  ,                                                                                                                                 \
    TYPE& NAME() { return BOOST_PP_CAT(NAME, _); }                                                                                  \
  )

#define _DECLARE_SOA_ACCESSOR(R, DATA, TYPE_NAME)                                                                                   \
  BOOST_PP_EXPAND(_DECLARE_SOA_ACCESSOR_IMPL TYPE_NAME)

#define _DECLARE_SOA_ACCESSORS(...)                                                                                                 \
  BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))

/* declare AoS-like element accessors; these should expand to, for columns:
 *
 *   double & x() { return * (soa_.x() + index_); }
 *
 * and for scalars:
 *
 *   double & x() { return soa_.x(); }
 *
 */

#define _DECLARE_SOA_ELEMENT_ACCESSOR_IMPL(IS_COLUMN, TYPE, NAME)                                                                   \
  BOOST_PP_IIF(IS_COLUMN,                                                                                                           \
    TYPE & NAME() { return * (soa_. NAME () + index_); }                                                                            \
  ,                                                                                                                                 \
    TYPE & NAME() { return soa_. NAME (); }                                                                                         \
  )

#define _DECLARE_SOA_ELEMENT_ACCESSOR(R, DATA, TYPE_NAME)                                                                           \
  BOOST_PP_EXPAND(_DECLARE_SOA_ELEMENT_ACCESSOR_IMPL TYPE_NAME)

#define _DECLARE_SOA_ELEMENT_ACCESSORS(...)                                                                                         \
  BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ELEMENT_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))

/* compile-time sized SoA
*/

#define declare_SoA_template(CLASS, ...)                                                                                            \
template <size_t SIZE, size_t ALIGN=0>                                                                                              \
struct CLASS {                                                                                                                      \
                                                                                                                                    \
  /* these could be moved to an external type trait to free up the symbol names */                                                  \
  using self_type = CLASS;                                                                                                          \
  static const size_t size = SIZE;                                                                                                  \
  static const size_t alignment = ALIGN;                                                                                            \
                                                                                                                                    \
  /* AoS-like accessor to individual elements */                                                                                    \
  struct element {                                                                                                                  \
    element(CLASS & soa, size_t index) :                                                                                            \
      soa_(soa),                                                                                                                    \
      index_(index)                                                                                                                 \
    { }                                                                                                                             \
                                                                                                                                    \
    _DECLARE_SOA_ELEMENT_ACCESSORS(__VA_ARGS__)                                                                                     \
                                                                                                                                    \
  private:                                                                                                                          \
    CLASS & soa_;                                                                                                                   \
    const size_t index_;                                                                                                            \
  };                                                                                                                                \
                                                                                                                                    \
  /* AoS-like accessor */                                                                                                           \
  element operator[](size_t index) { return element(*this, index); }                                                                \
                                                                                                                                    \
  /* accessors */                                                                                                                   \
  _DECLARE_SOA_ACCESSORS(__VA_ARGS__)                                                                                               \
                                                                                                                                    \
private:                                                                                                                            \
  /* data members */                                                                                                                \
  _DECLARE_SOA_DATA_MEMBERS(__VA_ARGS__)                                                                                            \
                                                                                                                                    \
}

sample usage

// declare a statically-sized SoA, templated on the column size and (optional) alignment
declare_SoA_template(SoA,
  // predefined static scalars
  // size_t size;
  // size_t alignment;

  // columns: one value per element
  SoA_column(double, x),
  SoA_column(double, y),
  SoA_column(double, z),
  SoA_column(uint16_t, colour),
  SoA_column(int32_t, value),

  // scalars: one value for the whole structure
  SoA_scalar(const char *, description)
);

int main(void) {
  // 10 elements, aligned on 32-bytes boundary
  SoA<10, 32> soa;

  // SoA-like access
  soa.z()[7] = 42;

  // AoS-like access
  soa[7].z() = 42;

  // check that SoA-like and SoA-like access the same elements
  assert(& soa.z()[7] == & (soa[7].z()));
}

fwyzard commented 4 years ago

@vkhristenko if you are interested

makortel commented 4 years ago

I had a chance to discuss with Alex Huebl (former alpaka developer) in the ATPESC training (he was there as a student as well). He pointed me to the LLAMA (Low Level Abstraction of Memory Access) library that the same team has been developing.

On a cursory look it seems interesting (might not solve everything but at least has interesting constructs)

C++11 header only, based on templates
Independent from alpaka (but works well together), only dependence is Boost.Mp11 >= 1.66.0
User specifies separately the struct-like domain (can be nested arbitrarily) and the array-like domain. An address (index in array-like domain, nested names in struct-like domain) is mapped to memory by "view" with the help of a user-defined "mapping".

Nested structs like

struct Pixel {
    struct {
        float r
        float g
        float b;
    } color;
    char alpha;
};

are expressed like

struct color {};
struct alpha {};
struct r {};
struct g {};
struct b {};

using Pixel = llama::DatumStruct <
    llama::DatumElement < color, llama::DatumStruct <
        llama::DatumElement < r, float >,
        llama::DatumElement < g, float >,
        llama::DatumElement < b, float >
    > >,
    llama::DatumElement < alpha, char >
>;

(there are shorthands llama::DS and llama::DE for DatumStruct and DatumElement, respetively) Note how DatumStructs can be included within other DatumStructs (good for composability). The DatumElement is uniquely identified by the tag class, i.e. if different DatumStructs have DatumElements with the same tag class, LLAMA knows that these elements are of the same type (and allows to do some neat tricks with them).

Assuming a 3D array of Pixel objects the element access

array[1][2][3].color.g = 1.0;

is expressed in LLAMA as

view( 1, 2, 3 )( color(), g() ) = 1.0;
// or
view( 1, 2, 3 ).access<color, g>() = 1.0;

This interface is always the same regardless of the underlying memory structure (AoS, SoA, something more complicated).

For more information I suggest to take a look of the documentation. It's quite short and has nice examples of how it looks like.

fwyzard commented 4 years ago

For more information I suggest to take a look of the documentation. It's quite short and has nice examples of how it looks like.

@makortel thanks for the pointers. I've had a look and so far I am not overly fond of the syntax and complexity...

ericcano commented 3 years ago

A SoA with variable sized arrays prototype is available here: https://github.com/ericcano/soa/blob/master/soa_v6.h. It introduces an AoS-like syntax soa[i].x. This should help porting or AoS code to SoA. We do not have array iterator support yet, so this is not a drop-in replacement for all use cases, but iterating on the index works.

A test port to HGCAL is available in this branch: https://github.com/ericcano/cmssw/tree/hgcal-clue (forked from: https://github.com/b-fontana/cmssw/tree/clue_porting_to_cmssw).

VinInn commented 3 years ago

On 29 Jun, 2021, at 4:40 PM, Eric Cano @.***> wrote:

A SoA with variable sized arrays prototype is available here: https://github.com/ericcano/soa/blob/master/soa_v6.h. It introduces an AoS-like syntax soa[i].x.

A test port to HGCAL is available in this branch: https://github.com/ericcano/cmssw/tree/hgcal-clue (forked from: https://github.com/b-fontana/cmssw/tree/clue_porting_to_cmssw). Could you please share a more direct pointer to the code where is implemented?

v.

ericcano commented 3 years ago

The main macro is used to define a SoA class for HGCal. The AoS-like syntax is achieved with operator[] returning a struct of SoAValues, which behave like scalars through cast operators.

VinInn commented 3 years ago

do you have ptx (even for some simple examples)?

ericcano commented 3 years ago

I finished a test (which generated a SoA GPU side and checks the contents CPU side). The generated PTX is in this gist.

The assignment step takes 2 steps to compute the address before accessing the global memory:

///data/user/cano/HGCAL/CMSSW_12_0_X_2021-06-23-2300/src/CUDADataFormats/Common/interface/SoAmacros.h:28     SOA_HOST_DEVICE T& operator= (const T2& v) { col_[idx_] = v; return col_[idx_]; }
    .loc    2 28 80, function_name Linfo_string0, inlined_at 1 79 5
    shl.b64     %rd17, %rd1, 3;
    add.s64     %rd18, %rd16, %rd17;
    st.global.f64   [%rd18], %fd2;

VinInn commented 3 years ago

would be interesting to compare w/r/t a naive implementation a cross product:

v3[i].x = v2[i].y* v1[i].z +  v1[i].y* v2[i].z;
v3[i].y = v2[i].x* v1[i].z +  v1[i].x* v2[i].z;
v3[i].z = v2[i].y* v1[i].x +  v1[i].y* v2[i].x;

ericcano commented 3 years ago

I made various flavors of the cross product in this repository. The PTX is not very appealing as we get tons of integer computations for few floating point. I suspect and hope this is better optimized during the conversion from PTX to assembly (did not check yet).

On the C++ syntax front, we can note that the same templated function could be used with a mixture of a SoA and AoS.

Finally, the C++ language does not allow creation of references to temporary variables (and all compilers, gcc and nvcc alike, abide by this rule). Workarounds like this are probably unavoidable to call functions that get references (const references are not affected):

    auto ri = r[i];
    crossProduct(ri, a[i], b[i]);

VinInn commented 3 years ago

this is what Eigen is able to do https://godbolt.org/z/jrGbcq5jE

VinInn commented 3 years ago

and here a little example that compiles and run https://github.com/VinInn/ctest/blob/master/eigen/dynSoa.cpp

which also demonstrates that there is no problem in assigning to a temporary at least with Eigen. and NO, I do not think that we shall rewrite Eigen.

ericcano commented 3 years ago

The issue was simply a compilation mistake: using the -G option cancels all optimizations. With this fixed, we get very similar PTX for the Eigen based cross product and for the hand made one. At run time, on a T4 (iterating on the same SoA of 10k elements, using doubles), we get 6.9us for the hand made version and 6.3 us for the Eigen based one.

The main difference between the two seems to be the ratio of integer computations vs pointer conversions (cvta.to.global.u64). The handmade version passes one pointer per vector coordinate, while the Eigen one has one pointer per vector only, at the cost of more integer operations: 14 integer operations + 3 pointer conversions for the Eigen version, 10 integer operations + 9 pointer conversions for the hand made version.

Handmade version also loads more parameters.

On the floating point side, we have 6 multiplications and 3 subtraction in both cases, as one would expect.

That would hint to a SoA system using more pointer arithmetic and less pointer storage to gain a bit of performance. I will try another flavor, where we store offsets instead of pointers.

Another direction of investigation is to integrate support for Eigen structures as columns: the current example uses the fact the we declared x, y and z next to each other to made a vector out of the 3. This should be automated to avoid code repetition and caveats (matrices will be impractical to declare in the current fashion).

Finally, this example uses a separate data block, and descriptor class pointing to it. We could merge the two together, further limiting the passing of big parameters to the kernel (from a structure to a single pointer). (I will leave this aside for the moment).

VinInn commented 3 years ago

let me note that in my example https://godbolt.org/z/K61q8zMW5 cuda loads the input in not-coherent memory (faster and separated from the standard cache)

VinInn commented 3 years ago

you may also note that in your ptx there are 12 loads while in eigen just 6 (as should be)

VinInn commented 3 years ago

another unrelated point we may wish to report to NVidia is that clang for instance generated mul neg fma instead of mul mul sub https://godbolt.org/z/xjbMsoed7 but in not able to load in nc memory...

ericcano commented 3 years ago

Thanks, @VinInn , the double loading from global (with a cache hit on the second, but still...) was a significant difference. Adding the restricted attributes enabled both the use of non coherent cache and the removal of the duplicate load. This leads to a marginal gain in both cases (~50ns), but the difference remains (now 6.90us vs 6.25us).

ericcano commented 3 years ago

another unrelated point we may wish to report to NVidia is that clang for instance generated mul neg fma instead of mul mul sub https://godbolt.org/z/xjbMsoed7 but in not able to load in nc memory...

Indeed. There are also integer multiplications by constant powers of 2 instead of shifts in the clang compiled code.

Also, nvcc uses mad on integers (but not fma). It could be a deliberate choice too, though, avoiding the extra neg for performance at the detriment of precision.

ericcano commented 3 years ago

The offset based version of the SoA (as opposed to pointer based) did allow the reduction of address conversions, but at the detriment of many things (more parameters, loss of non-coherent cache use, re-introduction of multiple fetches of the same variables from global memory). It is version 8 in the same repository. The new PTX for the indirectCrossProduct kernel is here.

Here are 2 measurements of the kernel timings (event based for a loop of 20 kernels) and via nvprof for both versions:

$ nvprof ./test_v7_cuda
Running ..==784533== NVPROF is profiling process 784533, command: ./test_v7_cuda
...indirectCrossProductSoA time=160.352 us.
.eigenCrossProductSoA time=156.032 us.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
                   21.92%  139.14us        21  6.6250us  6.3360us  7.3920us  indirectCrossProduct
                   20.22%  128.35us        20  6.4170us  6.1120us  6.8160us  eigenCrossProduct

$ nvprof ./test_v8_cuda
Running ..==784574== NVPROF is profiling process 784574, command: ./test_v8_cuda
...indirectCrossProductSoA time=164.832 us.
.eigenCrossProductSoA time=155.84 us.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
                   22.46%  143.42us        21  6.8290us  6.6240us  7.2960us  indirectCrossProduct
                   19.89%  127.01us        20  6.3500us  6.1440us  6.6240us  eigenCrossProduct

These numbers are slightly different from the previous ones which I got via an nvprof dump and looking at them in nvvp (but this new way is faster).

With more statistics, the version 7 is actually quite close to the eigen one. As the eigen one is slightly advantaged as we do some pointer computations host side, this is not that far off.

One positive outcome of this exercise is that the SoA layout could be changed without modifying the client code.

The next and more promising step is to add support for Eigen fixed size types to the pointer based version (v7).

VinInn commented 3 years ago

do not understand why more parameters: if you have any "struct" say double x,y,z; int n,m; bool k; the offsets are easy to compute at compile time (a double is 2 int or 8 bools) no difference w/r/t say bool[3*8+2*4+1]; so it is just

auto y(int i) const  {  return *(((double const*)(storage)+ (1*1)*stride +i));} 
auto n(int i) const  {  return *(((int const*)(storage)+ (3*2)*stride +i));} 
auto k(int i) const {  return *(((bool const*)(storage)+ (3*8+2*4)*stride +i));}

if I got the parenthesis right

ericcano commented 3 years ago

This is all with runtime defined size. v7 keeps one pointer per column, v8 has one global base pointer + one offset per column. So in v8, we get one extra parameter per SoA to load (the base pointer) on top of the offsets for any member used. In v7, we directly load the pointers.

The SoA structure is fully defined by a pointer, the number of elements (and an optional byte alignment parameter), so we could also pass just the first 2 and re-compute the SoA kernel side too. Here is the generated v7 constructor, where members x_, y_, etc... are pointers:

    SoA(std::byte *mem, size_t nElements, size_t byteAlignment = defaultAlignment) : mem_(mem), nElements_(nElements), byteAlignment_(byteAlignment) {
      auto curMem = mem_;
      x_ = reinterpret_cast<double *>(curMem);
      curMem += (((nElements_ * sizeof(double) - 1) / byteAlignment_) + 1) * byteAlignment_;
      y_ = reinterpret_cast<double *>(curMem);
      curMem += (((nElements_ * sizeof(double) - 1) / byteAlignment_) + 1) * byteAlignment_;
      z_ = reinterpret_cast<double *>(curMem);
      curMem += (((nElements_ * sizeof(double) - 1) / byteAlignment_) + 1) * byteAlignment_;
      colour_ = reinterpret_cast<uint16_t *>(curMem);
      curMem += (((nElements_ * sizeof(uint16_t) - 1) / byteAlignment_) + 1) * byteAlignment_;
      value_ = reinterpret_cast<int32_t *>(curMem);
      curMem += (((nElements_ * sizeof(int32_t) - 1) / byteAlignment_) + 1) * byteAlignment_;
      py_ = reinterpret_cast<double **>(curMem);
      curMem += (((nElements_ * sizeof(double *) - 1) / byteAlignment_) + 1) * byteAlignment_;
      description_ = reinterpret_cast<const char **>(curMem);
      curMem += (((sizeof(const char *) - 1) / byteAlignment_) + 1) * byteAlignment_;
      if (mem_ + computeDataSize(nElements_, byteAlignment_) != curMem)
        throw std::out_of_range("In "
                                "SoA"
                                "::"
                                "SoA"
                                ": unexpected end pointer.");
    }

In all examples, this computation is done host side, but we could also push it device side, and re-construct the SoA descriptor there. The mem buffer is pre-allocated by the user, with a size automatically computed by the static member function computeDataSize().

ericcano commented 3 years ago

@VinInn , yes. I used byte alignment, while you have index alignment (which ensures byte alignment and simplicity, at the cost of a bit more padding).

I will try that too, it's indeed going to solve the parameter problem.

This will require integrating the size of column members (yColumnPosition = xColumnPosition + sizeof (yElement) ), which might be tricky in the current macro based generation (to be investigated). The macros are written once for all and work fine, but are not easy to debug.

Which bring be to this question: what about using an external code generator (like a python script, or some pre-existing tool). This would not change the principle of what we have currently: a description of the structure turned into code by an engine.

Right now the description is:

  declare_SoA_template(SoA,
    // predefined static scalars
    // size_t size;
    // size_t alignment;

    // columns: one value per element
    SoA_column(double, x),
    SoA_column(double, y),
    SoA_column(double, z),
    SoA_column(uint16_t, colour),
    SoA_column(int32_t, value),
    SoA_column(double *, py),

    // scalars: one value for the whole structure
    SoA_scalar(const char *, description)
  );

The engine is soa_vX.h, and generated code is in the preprocessed output.

Does this sound like a good or bad idea to use an external tool? That would require some support from the build system, obviously.

VinInn commented 3 years ago

I personally dislike both macros and code generators (not that complex template manipulation is better, still) The questiona about code generation are (the usual) 1) scram integration 2) git integration 3) lxr/dxr/doxigen integration 4) debugging 5) platform dependency 6) ....

fwyzard commented 3 years ago

Some random comments, as I'm not fully following the discussion.

I would prefer to avoid an external code generator if C++ is enough. About macrosvs templates I have mixed feelings: templates are more interested in the language and (more) type safe than macros, but not flexible enough for a clean declaration syntax. Macros turned out to be a bit lacking as well, but less so.

About individual pointers vs. base pointer + deltas: I assumed individual pointers would be better based on some comments Vincenzo made some time ago - but I don't have any evidence either way. An advantage would be that they would allow slicing a SoA (like the EDM approach), a disadvantage is the amount of parameters being passed and/or an extra indirection.

For deltas I would use byte addressing, because I expect SoAs to have mixed types.

VinInn commented 3 years ago

My summary: Under the assumption that 1) optimizing the memory size beyond what the allocation for the AoS would do is not the goal (and I remind that we allocate memory by power of 2 so in practice will be very rare to allocate twice as much memory that needed) 2) that PR Review will catch badly padded "struct" (only to optimize memory, no issue with delta computation) 3) that anybody able to code 2K lines of cuda should also able to manage a pointer and some offsets 4) that byte alignment can be decided at compile time (and most probably once for all)

something like this may work and seem to generate optimal access code (comparable to Eigen) a) declare the struct as for AoS: say struct S {double x,y,z; int n,m; bool k; }; b) allocate storage with say auto storage = malloc(sizeof(S)*stride); (AoS and SoA occupy the same memory) b2) where stride is byte aligned nElements c) declare SoA accessor as

auto y(int i) const  {  return *((double const*)( storage+ offsetof(S,y)*stride)) +i);} 
auto n(int i) const  {  return *((int const*)(storage+  offsetof(S,n)*stride) +i);} 
auto k(int i) const {  return *((bool const*)(storage+  offsetof(S,k)*stride) +i);}

pass storage and stride to the kernel (and nElements (even if stride could be computed from nElements on the fly) but one may reuse the same storage for different nElements or wish to access only a limited range))

how this is generated up to you.

ericcano commented 3 years ago

The version with embedded Eigen structures now works, passes tests. I made a few variants, including trying to push the strict minimum (data pointer + size) and re-constructing the SoA descriptor device side (it's the local object variant). The resulting PTX is here, while the test timings are:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
                    8.92%  139.23us        21  6.6300us  6.4000us  6.8480us  indirectCrossProductSoA
                    8.45%  131.78us        20  6.5880us  6.4320us  7.1040us  embeddedCrossProductLocalObjectSoA
                    8.14%  127.07us        20  6.3530us  6.1120us  6.5920us  eigenCrossProductSoA
                    7.89%  123.14us        20  6.1560us  6.0480us  6.3360us  embeddedCrossProductSoA

Regarding the PTX, it's pretty well understood, except for the use of non-coherent cache, which I did not manage to get for the embedded (Eigen vector) versions. There are no duplicate loads, and the parameter loads are as one would expect, matching the data used in the kernel. The extreme case of local constructed object is not the fastest, in spite of loading only 2 parameters. The extra integer computations probably explain it. The embedded cross product is the fastest, in spite of not using the NC cache. Saving the construction of the Eigen vectors might explain this (I did not count the integer operations precisely).

The embedding code is generic, so we could throw any fixed sized Eigen element into the SoA. The macro based declaration is like this:

  declare_SoA_template(SoA,
    // columns: one value per element
    SoA_column(double, x),
    SoA_column(double, y),
    SoA_column(double, z),
    SoA_eigenColumn(Eigen::Vector3d, a),
    SoA_eigenColumn(Eigen::Vector3d, b),
    SoA_eigenColumn(Eigen::Vector3d, r),
    SoA_column(uint16_t, colour),
    SoA_column(int32_t, value),
    SoA_column(double *, py),

    // scalars: one value for the whole structure
    SoA_scalar(const char *, description)
  );

and the SoA is constructed and used like this (example from the locally constructed descriptor in the kernel, data block is already populated):

    bool deviceConstructor = true;
    testSoA::SoA soa(deviceConstructor, data, nElements);
    soa[i].r() = soa[i].a().cross(soa[i].b());

I did not mange to get the cast trick to work here, so we might end up with slightly heavier syntax with parenthesis to access members (for uniformity). The boolean was necessary to get a device-only constructor, it is not used.

fwyzard commented 2 years ago

@VinInn , I'm finally getting back to the SoA business.

Since I'm still catching up, could you explain your example ?

and here a little example that compiles and run https://github.com/VinInn/ctest/blob/master/eigen/dynSoa.cpp

which also demonstrates that there is no problem in assigning to a temporary at least with Eigen.

In particular, where is the assignment to a temporary ?

After sitting with Eric and looking at his approach, I now understand that the problem is not in "assigning to a temporary", but in "passing a non-const reference to a temporary". In fact, we don't need to do that - the element we have in our SoA approach already behaves like a reference, so we can just pass it around by value.

cms-patatrack / cmssw

Good SoA abstraction #272

100 has an attempt with lots of duplication on the SoA definition side (so rather poor from programming model point of view).

271 has an approach similar to https://github.com/cms-patatrack/cmssw/issues/272#issuecomment-464870138, i.e. a C struct.

step 1: what it might look like

definition

sample usage

step 2: use private members and public accessors

definition

sample usage

step 3: make data member configurable

definition

sample usage

step 4: add scalar types

definition

sample usage