Open makortel opened 5 years ago
In https://github.com/cms-patatrack/cmssw/pull/100#issuecomment-462651523 @VinInn commented
about SOA "abstractions": this is the best I managed to get so far: https://github.com/VinInn/ctest/blob/master/eigen/soaGeom.cpp previous incarnations (not very satisfactory to my taste): https://github.com/VinInn/ctest/blob/master/eigen/soaGeomV2.cpp https://github.com/VinInn/ctest/blob/master/eigen/soaGeomV1.cpp
they are all CUDA compliant
Some downsides regarding general use
In production there is no need of runtime sizing: if one size does not fill all, we just compile different size and template everything. (for HLT I would like to see first a REAL USE CASE for runtime sizing) dynamic allocation (i.e. resizing) for a SOA is simply not an option
I was thinking to enforce alignment at cacheline level (if not a page level) These are not typically large block of memory... at the moment one can use "alignas" but at the end one will have to useeither the c++17 "new" with alignment or aligned_alloc (AND YES, better the user know that) on cuda side my understanding is that everything is taken care
I would point out that "dynamic allocation" and "resizing" are different: we can certainly have data structures (including SoA's) with a size determined at construction time, that do not support resizing.
I agree that, in general, we do not want to resize structures once the event data has been filled.
But I would say that we want to be able to handle collections of whose size we do not know a priory.
I would point out that "dynamic allocation" and "resizing" are different: we can certainly have data structures (including SoA's) with a size determined at construction time, that do not support resizing.
But I would say that we want to be able to handle collections of whose size we do not know a priory.
This was exactly the case I meant. I rephrased my earlier comment to avoid further confusion.
on cuda side my understanding is that everything is taken care
For the beginning of the memory block yes, but with
struct Foo {
float pt[100];
float eta[100];
};
the element Foo::eta
would not be 128-byte aligned.
so we need to static assert that N is a multiple of 128 (which is sane anyway)
so we need to static assert that N is a multiple of 128 (which is sane anyway)
Yup. But the alignment is still a user's responsibility.
What about std::aligned_storage
and/or alignas(...)
?
What about std::aligned_storage and/or alignas(...) ? sure
template<typenameM, intS> struct alignas(128) ScalarSOA { usingScalar = M;
constexprScalar & operator()(uint32_t i) { returndata[i];} constexprconstScalar operator()(uint32_t i) const{ returndata[i];}
Scalar data[S]; static_assert(sizeof(data)%128==0); };
see
https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAKxAEZSAbAQwDtRkBSAJgCFufSAZ1QBXYskwgA5NwDMeFsgYisAag6yAwsgC2TAgg3YOABgCCchUpWZ1W5IIL4WBI6Ytd5i5Wo2bseMCYLAD0miSYbuaW3jZ2/oHBIQDimKg6mATEAJ5RFtGeVj62fgBumMhExHkx1r5aGTokubLGBV51JVpMDMAkhAg6NeZoLI6YAB4ADsSqIgoEslwA%2BgSqehMAchAAlOoA7DyqxJliLKoArABUtCZcACwafPsAIu4iggrAqgBqsnYvVQBIIsEAgH4VKqyABmT3enzYv1oFwBQMSoJAAFl9MQ8BM/NCGKh9KRkaS3LI%2BNFqfkzI5iCJKqoACoAZQA8qyDlSzKpfv8pqhPgQ8KgWHDzHy/ut0sECCIhpT3FLkao0KUmLjWBIJbzVITiWtkAhNUFdRxXubzARMDopswbX4CNkpsEmBlVJjSKoFqpWXl6Yy1j1AiwmIIILQuAAOPbYrJ4jm8i08vkfL5%2B5A9TWo52usMezFg1lZ5jVSnqSVzBHfbFTVHA4Jgut%2BL2qEzexsY1kJrCtsEAJVQAHdBGYCOE7XgGJhmXgMldWaR/a07MYlVX04jNHWG%2Bjm0wpn5Ro5Pd6O2iQcXe5EtEWQEPR%2BPJ1Np7P55hF8ujGvdSEQmmNaeoee5Xlih6tuenb7iAACSLAsJgxA9rifZaCu67rjy/6ARmO4gRogJdgeR72GKp5thexFwQhSEoXgaGaBhFI8u4ypquRNrTLMu6oK6xD6CQuwQPMLhLKsPp7NyxynMQ5x1hA6D6Ew/B4DscKWlWJ5cTMqj4fWfFIYJxDCaJiwrGsakcWMawpjJ8pyXpClKQQKm8GpGlvDSfIltmswuSpFw8KyVz3o%2BY4Tukr4znOC73uEDARS%2Bb6xZEFxvBueqOPoeDIMs4aCEhBAQJ8ABeaTQopynqVwFxRtGhGESY6mZRaGWsdatr2vot6aHmbqFt6vrMeYgZMiGwBhhG9V7L5ZZJty7Fbt8c05oRnpPJWtJ8tpky6atszcAAbKohkCVUpkLOJllSXZJwOecAUcEFeDPe1bXsbt3HWaeB2nfxxmXWJFmST9tmHPZZyqE9L1vZ5bFVn9MPBXDmV8tlIp5QVRUlXg5WoJVAU1XVMaNbILzNealqtTSNp2g6vXDQGWRBqoE1TZGMZ7GynLCEwi1VvGuITEmfh/N%2Bq6CsKorimjwEJiL7IWFo4sYTKGQuAqup8kLiZK2LZJq%2Bqmp4Nqt6ppmfmi1oBokkbJrEGarXUx1ZhMCIRCqAARvMDDoLsAt6ny91Q446BgnoADWmDLCILB4AAjiIvU86yfN%2BBs2wtZhuxU15tJjWsPCoBMgd8tRZjQtCCiYC%2BTCVDCqhZKwgjQiQipHOx0omgw0IAOoMQYVp6nT3WOlo/UFrYAAKsh5DtnF7bMZnXT6YwMZgECz2D6hcCdUx7NpZd8qo/7u57h3k6oEAHwAdJqAnZLs99exGOwALQ9/3g8IPfxCP7nOWfJz6oGOKiCAzcxht2IDoK4d8H5MCfjsF%2Bb9P49G/ugAwf8AEtQtsHWS5wIAmUploSmvAiGRhapoMhPAKFcCoSQvBp8AIh0clASBrd25wOQQgpBKCA5%2BD5F/AemDf68N2DwhgDBAEWw%2BtEF2CNaTuC%2BrpFeIMWDMgEsgSOghUT3BMAATiOsPUoqAGLQ1QGYTkJUnBgnKJUEgfhU4/mOs3KgghvTF1LkfY63tvRhzsZCRxWgzLRlWC4veMlBB7BTIo9GtiQAcOgToCBAl3G3y9pgYAChdikDcYIW%2BwR/Y7FICcApmTsksFyQjAxpgDHqCCsdN6EAQFg18QQdSENWHnC9rfBQnwsAQNvlLQgMtcFtXGTSORSjzCmPMegVASYICp3Tg0JgWxAHYB3u0tJrJiSeJLtsyJXtvRqIklcKJMTDhxN3rIGYTBgB6FUMkTQmgfSlCwEeKs0Dr6tNeuTEwABVTa/zNAaK0Toza/BVJXKYWU56PBXrpVRL0/pG9UlMCoHspgwyhSjLFBANSky9TTL5GxTSZgpAlIYNIC4UhSCgikCYelqBpCaGhbwVQwgxASFubQelBBpDMp2CUyOIALhcFvlwfYRijr7AeBcfY9x9jRkYNIe49LGXMtIKyqQ9LBAgA7IKplVLSBwFgEgNAU4ZxkAoBAK10UkIgGANGLgpBq4MBtMQA1EAvZCvpV7BQmpsjSH5aQK1GsCDshYAwENJrSBYD0GwGc/qE14BOJUPA5QDXxsmBUD2kgpBhoWJgGlRb6UMDwF7ASORwhYFDQK3EOh/XUtYOwDlAhK2v3gBAKeIBMDolIOUYgiSGQsEjiKnVUwRTkWkO/dk0JwxrHfnoUoEwSbQlUCu6Eeh9WiHEJIWg1LaWatTbqiY0YjrvyOvcVQwBkDIFUNGKV19cCEBIHy70yUbV8r2Oy3g/ABUttFXQSV9wuAGKVfsC4JgjoxnuLIWQR01VSA1Qys90h9WGryS2s1MBEAgFEAQKYHtyCUAdW%2BEd9BMD4CqHQUgw4BJTAbceqQdL0Pxt1XIWgqhhwDFUBeq9N670PqfVKoDJrJ1ir3rffY%2Bw4OyFoHJ6DFwb0IZQ2hrVLLMNCGw8a4VeGe2WqipRsj9qTM2pQMwNgqwx2R3ddOL1Pq/XxsDWGHIDbw3pEjdG2NqbE1tpTfG/AGaRTZtTXm5ABbPMlrLdqrtNbsh1sLWGhMzby2trYCgDtjAq0GtgH2gdIIh1OpZuOydfEZ1jDnQupde6eWHtY%2BxrTOrpCCevbepQbbr5lYna%2B2jH7uNfos0hW59C9Idok8KkDR1b4mHuEY/RDwTBLAMRB2QGnT2cZ0wao1uHzUEaIyRggZmKOWbdTR99VGGNMZYyh5rGGZCeC4Lx/j7XhNdcRKkuOE6ptUpA2Bm9%2BxaBLFkCYEwrrkTRnuJtjj2rdVYb25J1jXAtvw50/p/7JXvUyxAPcIAA%3D%3D
a possible solution to keep the SOA fixed size and allow dynamic size setting is to bucketize (ASOA) see https://godbolt.org/z/-_aqEP or https://github.com/VinInn/ctest/blob/master/eigen/trivialASOA.cpp
provided the stride is a power of 2 it is quite trivial (and fast) to iterate and do random access....
and here the cuda version https://godbolt.org/z/VjCSRw or https://github.com/VinInn/ctest/blob/master/eigen/trivialASOA.cu
See also #344 .
Here is my attempt to write a SoA that:
Currently it does not support, but I plan to add them later:
The fields are something like
struct {
double x, y, z;
uint16_t colour;
int32_t value;
};
// compile-time sized SoA
template <size_t SIZE, size_t ALIGN=0>
struct SoA {
// these could be moved to an external type trait to free up the symbol names
static const size_t size = SIZE;
static const size_t alignment = ALIGN;
// AoS-like (row-wise) accessor to individual elements
struct element {
element(SoA & soa, size_t index) :
x(soa.x[index]),
y(soa.y[index]),
z(soa.z[index]),
colour(soa.colour[index]),
value(soa.value[index]),
{ }
double & x;
double & y;
double & z;
uint16_t & colour;
int32_t & value;
};
// AoS-like (row-wise) access to individual elements
element operator[](size_t index) { return element(*this, index); }
// data members
alignas(ALIGN) double x[SIZE];
alignas(ALIGN) double y[SIZE];
alignas(ALIGN) double z[SIZE];
alignas(ALIGN) uint16_t colour[SIZE];
alignas(ALIGN) int32_t value[SIZE];
};
int main(void) {
// 10 elements, aligned on 32-bytes boundary
SoA<10, 32> soa;
// SoA-like access
soa.z[7] = 42;
// AoS-like access
soa[7].z = 42;
// check that SoA-like and SoA-like access the same elements
assert(& soa.z[7] == & (soa[7].z));
}
Note that we could also use soa.z(7)
instead of soa.z()[7]
if people think it would be cleaner.
// compile-time sized SoA
template <size_t SIZE, size_t ALIGN=0>
struct SoA {
// these could be moved to an external type trait to free up the symbol names
static const size_t size = SIZE;
static const size_t alignment = ALIGN;
// AoS-like (row-wise) accessor to individual elements
struct element {
element(SoA & soa, size_t index) :
soa_(soa),
index_(index)
{ }
double & x() { return soa_.x_[index_]; }
double & y() { return soa_.y_[index_]; }
double & z() { return soa_.z_[index_]; }
uint16_t & colour() { return soa_.colour_[index_]; }
int32_t & value() { return soa_.value_[index_]; }
SoA & soa_;
const size_t index_;
};
// AoS-like (row-wise) access to individual elements
element operator[](size_t index) { return element(*this, index); }
// accessors
double* x() { return x_; }
double* y() { return y_; }
double* z() { return z_; }
uint16_t* colour() { return colour_; }
int32_t* value() { return value_; }
private:
// data members
alignas(ALIGN) double x_[SIZE];
alignas(ALIGN) double y_[SIZE];
alignas(ALIGN) double z_[SIZE];
alignas(ALIGN) uint16_t colour_[SIZE];
alignas(ALIGN) int32_t value_[SIZE];
alignas(ALIGN) const char* name_[SIZE];
};
int main(void) {
// 10 elements, aligned on 32-bytes boundary
SoA<10, 32> soa;
// SoA-like access
soa.z()[7] = 42;
// AoS-like access
soa[7].z() = 42;
// check that SoA-like and SoA-like access the same elements
assert(& soa.z()[7] == & (soa[7].z()));
}
The syntax for the declaration has same flexibility, if one wants to play a bit with it.
Note: scroll right for the \
continuations.
#include <boost/preprocessor/cat.hpp>
#include <boost/preprocessor/stringize.hpp>
#include <boost/preprocessor/seq/for_each.hpp>
#include <boost/preprocessor/variadic/to_seq.hpp>
/* declare SoA data members; these should exapnd to, for example
*
* alignas(ALIGN) double x_[SIZE];
*
*/
#define _DECLARE_SOA_DATA_MEMBER_IMPL(TYPE, NAME) \
alignas(ALIGN) TYPE BOOST_PP_CAT(NAME, _[SIZE]);
#define _DECLARE_SOA_DATA_MEMBER(R, DATA, TYPE_NAME) \
BOOST_PP_EXPAND(_DECLARE_SOA_DATA_MEMBER_IMPL TYPE_NAME)
#define _DECLARE_SOA_DATA_MEMBERS(...) \
BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_DATA_MEMBER, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))
/* declare SoA accessors; these should expand to, for example
*
* double* x() { return x_; }
*
*/
#define _DECLARE_SOA_ACCESSOR_IMPL(TYPE, NAME) \
TYPE* NAME() { return BOOST_PP_CAT(NAME, _); }
#define _DECLARE_SOA_ACCESSOR(R, DATA, TYPE_NAME) \
BOOST_PP_EXPAND(_DECLARE_SOA_ACCESSOR_IMPL TYPE_NAME)
#define _DECLARE_SOA_ACCESSORS(...) \
BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))
/* declare AoS-like element accessors; these should expand to, for example
*
* double & x() { return * (soa_.x() + index_); }
*
*/
#define _DECLARE_SOA_ELEMENT_ACCESSOR_IMPL(TYPE, NAME) \
TYPE & NAME() { return * (soa_. NAME () + index_); }
#define _DECLARE_SOA_ELEMENT_ACCESSOR(R, DATA, TYPE_NAME) \
BOOST_PP_EXPAND(_DECLARE_SOA_ELEMENT_ACCESSOR_IMPL TYPE_NAME)
#define _DECLARE_SOA_ELEMENT_ACCESSORS(...) \
BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ELEMENT_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))
/* compile-time sized SoA
*/
#define DECLARE_SOA_TEMPLATE(CLASS, ...) \
template <size_t SIZE, size_t ALIGN=0> \
struct CLASS { \
\
/* these could be moved to an external type trait to free up the symbol names */ \
using self_type = CLASS; \
static const size_t size = SIZE; \
static const size_t alignment = ALIGN; \
\
/* introspection */ \
static void dump() { \
std::cout << #CLASS "<" << SIZE << ", " << ALIGN << "): " << std::endl; \
std::cout << " sizeof(...): " << sizeof(CLASS) << std::endl; \
std::cout << " alignof(...): " << alignof(CLASS) << std::endl; \
_DECLARE_SOA_DUMP_INFOS(__VA_ARGS__) \
std::cout << std::endl; \
} \
\
/* AoS-like accessor to individual elements */ \
struct element { \
element(CLASS & soa, size_t index) : \
soa_(soa), \
index_(index) \
{ } \
\
_DECLARE_SOA_ELEMENT_ACCESSORS(__VA_ARGS__) \
\
private: \
CLASS & soa_; \
const size_t index_; \
}; \
\
/* AoS-like accessor */ \
element operator[](size_t index) { return element(*this, index); } \
\
/* accessors */ \
_DECLARE_SOA_ACCESSORS(__VA_ARGS__) \
\
private: \
/* data members */ \
_DECLARE_SOA_DATA_MEMBERS(__VA_ARGS__) \
\
}
DECLARE_SOA_TEMPLATE(SoA,
(double, x),
(double, y),
(double, z),
(uint16_t, colour),
(int32_t, value),
);
int main(void) {
// 10 elements, aligned on 32-bytes boundary
SoA<10, 32> soa;
// SoA-like access
soa.z()[7] = 42;
// AoS-like access
soa[7].z() = 42;
// check that SoA-like and SoA-like access the same elements
assert(& soa.z()[7] == & (soa[7].z()));
}
I realised that we may want to have scalar types, that are unique for the whole SoA, instead of per-element.
#include <boost/preprocessor.hpp>
/* declare "scalars" (one value shared across the whole SoA) and "columns" (one vale per element) */
#define SoA_scalar(TYPE, NAME) (0, TYPE, NAME)
#define SoA_column(TYPE, NAME) (1, TYPE, NAME)
/* declare SoA data members; these should exapnd to, for columns:
*
* alignas(ALIGN) double x_[SIZE];
*
* and for scalars:
*
* double x_;
*
*/
#define _DECLARE_SOA_DATA_MEMBER_IMPL(IS_COLUMN, TYPE, NAME) \
BOOST_PP_IIF(IS_COLUMN, \
alignas(ALIGN) TYPE BOOST_PP_CAT(NAME, _[SIZE]); \
, \
TYPE BOOST_PP_CAT(NAME, _); \
)
#define _DECLARE_SOA_DATA_MEMBER(R, DATA, TYPE_NAME) \
BOOST_PP_EXPAND(_DECLARE_SOA_DATA_MEMBER_IMPL TYPE_NAME)
#define _DECLARE_SOA_DATA_MEMBERS(...) \
BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_DATA_MEMBER, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))
/* declare SoA accessors; these should expand to, for columns:
*
* double* x() { return x_; }
*
* and for scalars:
*
* double& x() { return x_; }
*
*/
#define _DECLARE_SOA_ACCESSOR_IMPL(IS_COLUMN, TYPE, NAME) \
BOOST_PP_IIF(IS_COLUMN, \
TYPE* NAME() { return BOOST_PP_CAT(NAME, _); } \
, \
TYPE& NAME() { return BOOST_PP_CAT(NAME, _); } \
)
#define _DECLARE_SOA_ACCESSOR(R, DATA, TYPE_NAME) \
BOOST_PP_EXPAND(_DECLARE_SOA_ACCESSOR_IMPL TYPE_NAME)
#define _DECLARE_SOA_ACCESSORS(...) \
BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))
/* declare AoS-like element accessors; these should expand to, for columns:
*
* double & x() { return * (soa_.x() + index_); }
*
* and for scalars:
*
* double & x() { return soa_.x(); }
*
*/
#define _DECLARE_SOA_ELEMENT_ACCESSOR_IMPL(IS_COLUMN, TYPE, NAME) \
BOOST_PP_IIF(IS_COLUMN, \
TYPE & NAME() { return * (soa_. NAME () + index_); } \
, \
TYPE & NAME() { return soa_. NAME (); } \
)
#define _DECLARE_SOA_ELEMENT_ACCESSOR(R, DATA, TYPE_NAME) \
BOOST_PP_EXPAND(_DECLARE_SOA_ELEMENT_ACCESSOR_IMPL TYPE_NAME)
#define _DECLARE_SOA_ELEMENT_ACCESSORS(...) \
BOOST_PP_SEQ_FOR_EACH(_DECLARE_SOA_ELEMENT_ACCESSOR, ~, BOOST_PP_VARIADIC_TO_SEQ(__VA_ARGS__))
/* compile-time sized SoA
*/
#define declare_SoA_template(CLASS, ...) \
template <size_t SIZE, size_t ALIGN=0> \
struct CLASS { \
\
/* these could be moved to an external type trait to free up the symbol names */ \
using self_type = CLASS; \
static const size_t size = SIZE; \
static const size_t alignment = ALIGN; \
\
/* AoS-like accessor to individual elements */ \
struct element { \
element(CLASS & soa, size_t index) : \
soa_(soa), \
index_(index) \
{ } \
\
_DECLARE_SOA_ELEMENT_ACCESSORS(__VA_ARGS__) \
\
private: \
CLASS & soa_; \
const size_t index_; \
}; \
\
/* AoS-like accessor */ \
element operator[](size_t index) { return element(*this, index); } \
\
/* accessors */ \
_DECLARE_SOA_ACCESSORS(__VA_ARGS__) \
\
private: \
/* data members */ \
_DECLARE_SOA_DATA_MEMBERS(__VA_ARGS__) \
\
}
// declare a statically-sized SoA, templated on the column size and (optional) alignment
declare_SoA_template(SoA,
// predefined static scalars
// size_t size;
// size_t alignment;
// columns: one value per element
SoA_column(double, x),
SoA_column(double, y),
SoA_column(double, z),
SoA_column(uint16_t, colour),
SoA_column(int32_t, value),
// scalars: one value for the whole structure
SoA_scalar(const char *, description)
);
int main(void) {
// 10 elements, aligned on 32-bytes boundary
SoA<10, 32> soa;
// SoA-like access
soa.z()[7] = 42;
// AoS-like access
soa[7].z() = 42;
// check that SoA-like and SoA-like access the same elements
assert(& soa.z()[7] == & (soa[7].z()));
}
@vkhristenko if you are interested
I had a chance to discuss with Alex Huebl (former alpaka developer) in the ATPESC training (he was there as a student as well). He pointed me to the LLAMA (Low Level Abstraction of Memory Access) library that the same team has been developing.
On a cursory look it seems interesting (might not solve everything but at least has interesting constructs)
Nested structs like
struct Pixel {
struct {
float r
float g
float b;
} color;
char alpha;
};
are expressed like
struct color {};
struct alpha {};
struct r {};
struct g {};
struct b {};
using Pixel = llama::DatumStruct <
llama::DatumElement < color, llama::DatumStruct <
llama::DatumElement < r, float >,
llama::DatumElement < g, float >,
llama::DatumElement < b, float >
> >,
llama::DatumElement < alpha, char >
>;
(there are shorthands llama::DS
and llama::DE
for DatumStruct
and DatumElement
, respetively)
Note how DatumStruct
s can be included within other DatumStruct
s (good for composability). The DatumElement
is uniquely identified by the tag class, i.e. if different DatumStruct
s have DatumElement
s with the same tag class, LLAMA knows that these elements are of the same type (and allows to do some neat tricks with them).
Assuming a 3D array of Pixel
objects the element access
array[1][2][3].color.g = 1.0;
is expressed in LLAMA as
view( 1, 2, 3 )( color(), g() ) = 1.0;
// or
view( 1, 2, 3 ).access<color, g>() = 1.0;
This interface is always the same regardless of the underlying memory structure (AoS, SoA, something more complicated).
For more information I suggest to take a look of the documentation. It's quite short and has nice examples of how it looks like.
For more information I suggest to take a look of the documentation. It's quite short and has nice examples of how it looks like.
@makortel thanks for the pointers. I've had a look and so far I am not overly fond of the syntax and complexity...
A SoA with variable sized arrays prototype is available here: https://github.com/ericcano/soa/blob/master/soa_v6.h. It introduces an AoS-like syntax soa[i].x
. This should help porting or AoS code to SoA. We do not have array iterator support yet, so this is not a drop-in replacement for all use cases, but iterating on the index works.
A test port to HGCAL is available in this branch: https://github.com/ericcano/cmssw/tree/hgcal-clue (forked from: https://github.com/b-fontana/cmssw/tree/clue_porting_to_cmssw).
On 29 Jun, 2021, at 4:40 PM, Eric Cano @.***> wrote:
A SoA with variable sized arrays prototype is available here: https://github.com/ericcano/soa/blob/master/soa_v6.h. It introduces an AoS-like syntax soa[i].x.
A test port to HGCAL is available in this branch: https://github.com/ericcano/cmssw/tree/hgcal-clue (forked from: https://github.com/b-fontana/cmssw/tree/clue_porting_to_cmssw). Could you please share a more direct pointer to the code where is implemented?
v.
The main macro is used to define a SoA class for HGCal. The AoS-like syntax is achieved with operator[]
returning a struct of SoAValue
s, which behave like scalars through cast operators.
do you have ptx (even for some simple examples)?
I finished a test (which generated a SoA GPU side and checks the contents CPU side). The generated PTX is in this gist.
The assignment step takes 2 steps to compute the address before accessing the global memory:
///data/user/cano/HGCAL/CMSSW_12_0_X_2021-06-23-2300/src/CUDADataFormats/Common/interface/SoAmacros.h:28 SOA_HOST_DEVICE T& operator= (const T2& v) { col_[idx_] = v; return col_[idx_]; }
.loc 2 28 80, function_name Linfo_string0, inlined_at 1 79 5
shl.b64 %rd17, %rd1, 3;
add.s64 %rd18, %rd16, %rd17;
st.global.f64 [%rd18], %fd2;
would be interesting to compare w/r/t a naive implementation a cross product:
v3[i].x = v2[i].y* v1[i].z + v1[i].y* v2[i].z;
v3[i].y = v2[i].x* v1[i].z + v1[i].x* v2[i].z;
v3[i].z = v2[i].y* v1[i].x + v1[i].y* v2[i].x;
I made various flavors of the cross product in this repository. The PTX is not very appealing as we get tons of integer computations for few floating point. I suspect and hope this is better optimized during the conversion from PTX to assembly (did not check yet).
On the C++ syntax front, we can note that the same templated function could be used with a mixture of a SoA and AoS.
Finally, the C++ language does not allow creation of references to temporary variables (and all compilers, gcc and nvcc alike, abide by this rule). Workarounds like this are probably unavoidable to call functions that get references (const references are not affected):
auto ri = r[i];
crossProduct(ri, a[i], b[i]);
this is what Eigen is able to do https://godbolt.org/z/jrGbcq5jE
and here a little example that compiles and run https://github.com/VinInn/ctest/blob/master/eigen/dynSoa.cpp
which also demonstrates that there is no problem in assigning to a temporary at least with Eigen. and NO, I do not think that we shall rewrite Eigen.
The issue was simply a compilation mistake: using the -G
option cancels all optimizations. With this fixed, we get very similar PTX for the Eigen based cross product and for the hand made one. At run time, on a T4 (iterating on the same SoA of 10k elements, using doubles), we get 6.9us for the hand made version and 6.3 us for the Eigen based one.
The main difference between the two seems to be the ratio of integer computations vs pointer conversions (cvta.to.global.u64
). The handmade version passes one pointer per vector coordinate, while the Eigen one has one pointer per vector only, at the cost of more integer operations: 14 integer operations + 3 pointer conversions for the Eigen version, 10 integer operations + 9 pointer conversions for the hand made version.
Handmade version also loads more parameters.
On the floating point side, we have 6 multiplications and 3 subtraction in both cases, as one would expect.
That would hint to a SoA system using more pointer arithmetic and less pointer storage to gain a bit of performance. I will try another flavor, where we store offsets instead of pointers.
Another direction of investigation is to integrate support for Eigen structures as columns: the current example uses the fact the we declared x, y and z next to each other to made a vector out of the 3. This should be automated to avoid code repetition and caveats (matrices will be impractical to declare in the current fashion).
Finally, this example uses a separate data block, and descriptor class pointing to it. We could merge the two together, further limiting the passing of big parameters to the kernel (from a structure to a single pointer). (I will leave this aside for the moment).
let me note that in my example https://godbolt.org/z/K61q8zMW5 cuda loads the input in not-coherent memory (faster and separated from the standard cache)
you may also note that in your ptx there are 12 loads while in eigen just 6 (as should be)
another unrelated point we may wish to report to NVidia is that clang for instance generated mul neg fma instead of mul mul sub https://godbolt.org/z/xjbMsoed7 but in not able to load in nc memory...
Thanks, @VinInn , the double loading from global (with a cache hit on the second, but still...) was a significant difference. Adding the restricted attributes enabled both the use of non coherent cache and the removal of the duplicate load. This leads to a marginal gain in both cases (~50ns), but the difference remains (now 6.90us vs 6.25us).
another unrelated point we may wish to report to NVidia is that clang for instance generated mul neg fma instead of mul mul sub https://godbolt.org/z/xjbMsoed7 but in not able to load in nc memory...
Indeed. There are also integer multiplications by constant powers of 2 instead of shifts in the clang compiled code.
Also, nvcc uses mad
on integers (but not fma
). It could be a deliberate choice too, though, avoiding the extra neg
for performance at the detriment of precision.
The offset based version of the SoA (as opposed to pointer based) did allow the reduction of address conversions, but at the detriment of many things (more parameters, loss of non-coherent cache use, re-introduction of multiple fetches of the same variables from global memory). It is version 8 in the same repository. The new PTX for the indirectCrossProduct kernel is here.
Here are 2 measurements of the kernel timings (event based for a loop of 20 kernels) and via nvprof for both versions:
$ nvprof ./test_v7_cuda
Running ..==784533== NVPROF is profiling process 784533, command: ./test_v7_cuda
...indirectCrossProductSoA time=160.352 us.
.eigenCrossProductSoA time=156.032 us.
Type Time(%) Time Calls Avg Min Max Name
21.92% 139.14us 21 6.6250us 6.3360us 7.3920us indirectCrossProduct
20.22% 128.35us 20 6.4170us 6.1120us 6.8160us eigenCrossProduct
$ nvprof ./test_v8_cuda
Running ..==784574== NVPROF is profiling process 784574, command: ./test_v8_cuda
...indirectCrossProductSoA time=164.832 us.
.eigenCrossProductSoA time=155.84 us.
Type Time(%) Time Calls Avg Min Max Name
22.46% 143.42us 21 6.8290us 6.6240us 7.2960us indirectCrossProduct
19.89% 127.01us 20 6.3500us 6.1440us 6.6240us eigenCrossProduct
These numbers are slightly different from the previous ones which I got via an nvprof dump and looking at them in nvvp (but this new way is faster).
With more statistics, the version 7 is actually quite close to the eigen one. As the eigen one is slightly advantaged as we do some pointer computations host side, this is not that far off.
One positive outcome of this exercise is that the SoA layout could be changed without modifying the client code.
The next and more promising step is to add support for Eigen fixed size types to the pointer based version (v7).
do not understand why more parameters:
if you have any "struct" say double x,y,z; int n,m; bool k; the offsets are easy to compute at compile time (a double is 2 int or 8 bools) no difference w/r/t say bool[3*8+2*4+1]
; so it is just
auto y(int i) const { return *(((double const*)(storage)+ (1*1)*stride +i));}
auto n(int i) const { return *(((int const*)(storage)+ (3*2)*stride +i));}
auto k(int i) const { return *(((bool const*)(storage)+ (3*8+2*4)*stride +i));}
if I got the parenthesis right
This is all with runtime defined size. v7 keeps one pointer per column, v8 has one global base pointer + one offset per column. So in v8, we get one extra parameter per SoA to load (the base pointer) on top of the offsets for any member used. In v7, we directly load the pointers.
The SoA structure is fully defined by a pointer, the number of elements (and an optional byte alignment parameter), so we could also pass just the first 2 and re-compute the SoA kernel side too.
Here is the generated v7 constructor, where members x_
, y_
, etc... are pointers:
SoA(std::byte *mem, size_t nElements, size_t byteAlignment = defaultAlignment) : mem_(mem), nElements_(nElements), byteAlignment_(byteAlignment) {
auto curMem = mem_;
x_ = reinterpret_cast<double *>(curMem);
curMem += (((nElements_ * sizeof(double) - 1) / byteAlignment_) + 1) * byteAlignment_;
y_ = reinterpret_cast<double *>(curMem);
curMem += (((nElements_ * sizeof(double) - 1) / byteAlignment_) + 1) * byteAlignment_;
z_ = reinterpret_cast<double *>(curMem);
curMem += (((nElements_ * sizeof(double) - 1) / byteAlignment_) + 1) * byteAlignment_;
colour_ = reinterpret_cast<uint16_t *>(curMem);
curMem += (((nElements_ * sizeof(uint16_t) - 1) / byteAlignment_) + 1) * byteAlignment_;
value_ = reinterpret_cast<int32_t *>(curMem);
curMem += (((nElements_ * sizeof(int32_t) - 1) / byteAlignment_) + 1) * byteAlignment_;
py_ = reinterpret_cast<double **>(curMem);
curMem += (((nElements_ * sizeof(double *) - 1) / byteAlignment_) + 1) * byteAlignment_;
description_ = reinterpret_cast<const char **>(curMem);
curMem += (((sizeof(const char *) - 1) / byteAlignment_) + 1) * byteAlignment_;
if (mem_ + computeDataSize(nElements_, byteAlignment_) != curMem)
throw std::out_of_range("In "
"SoA"
"::"
"SoA"
": unexpected end pointer.");
}
In all examples, this computation is done host side, but we could also push it device side, and re-construct the SoA descriptor there. The mem
buffer is pre-allocated by the user, with a size automatically computed by the static member function computeDataSize()
.
@VinInn , yes. I used byte alignment, while you have index alignment (which ensures byte alignment and simplicity, at the cost of a bit more padding).
I will try that too, it's indeed going to solve the parameter problem.
This will require integrating the size of column members (yColumnPosition = xColumnPosition + sizeof (yElement) ), which might be tricky in the current macro based generation (to be investigated). The macros are written once for all and work fine, but are not easy to debug.
Which bring be to this question: what about using an external code generator (like a python script, or some pre-existing tool). This would not change the principle of what we have currently: a description of the structure turned into code by an engine.
Right now the description is:
declare_SoA_template(SoA,
// predefined static scalars
// size_t size;
// size_t alignment;
// columns: one value per element
SoA_column(double, x),
SoA_column(double, y),
SoA_column(double, z),
SoA_column(uint16_t, colour),
SoA_column(int32_t, value),
SoA_column(double *, py),
// scalars: one value for the whole structure
SoA_scalar(const char *, description)
);
The engine is soa_vX.h, and generated code is in the preprocessed output.
Does this sound like a good or bad idea to use an external tool? That would require some support from the build system, obviously.
I personally dislike both macros and code generators (not that complex template manipulation is better, still) The questiona about code generation are (the usual) 1) scram integration 2) git integration 3) lxr/dxr/doxigen integration 4) debugging 5) platform dependency 6) ....
Some random comments, as I'm not fully following the discussion.
I would prefer to avoid an external code generator if C++ is enough. About macrosvs templates I have mixed feelings: templates are more interested in the language and (more) type safe than macros, but not flexible enough for a clean declaration syntax. Macros turned out to be a bit lacking as well, but less so.
About individual pointers vs. base pointer + deltas: I assumed individual pointers would be better based on some comments Vincenzo made some time ago - but I don't have any evidence either way. An advantage would be that they would allow slicing a SoA (like the EDM approach), a disadvantage is the amount of parameters being passed and/or an extra indirection.
For deltas I would use byte addressing, because I expect SoAs to have mixed types.
My summary: Under the assumption that 1) optimizing the memory size beyond what the allocation for the AoS would do is not the goal (and I remind that we allocate memory by power of 2 so in practice will be very rare to allocate twice as much memory that needed) 2) that PR Review will catch badly padded "struct" (only to optimize memory, no issue with delta computation) 3) that anybody able to code 2K lines of cuda should also able to manage a pointer and some offsets 4) that byte alignment can be decided at compile time (and most probably once for all)
something like this may work and seem to generate optimal access code (comparable to Eigen)
a) declare the struct as for AoS: say struct S {double x,y,z; int n,m; bool k; };
b) allocate storage with say auto storage = malloc(sizeof(S)*stride);
(AoS and SoA occupy the same memory)
b2) where stride
is byte aligned nElements
c) declare SoA accessor as
auto y(int i) const { return *((double const*)( storage+ offsetof(S,y)*stride)) +i);}
auto n(int i) const { return *((int const*)(storage+ offsetof(S,n)*stride) +i);}
auto k(int i) const { return *((bool const*)(storage+ offsetof(S,k)*stride) +i);}
pass storage
and stride
to the kernel
(and nElements (even if stride could be computed from nElements on the fly) but one may reuse the same storage for different nElements
or wish to access only a limited range))
how this is generated up to you.
The version with embedded Eigen structures now works, passes tests. I made a few variants, including trying to push the strict minimum (data pointer + size) and re-constructing the SoA descriptor device side (it's the local object variant). The resulting PTX is here, while the test timings are:
Type Time(%) Time Calls Avg Min Max Name
8.92% 139.23us 21 6.6300us 6.4000us 6.8480us indirectCrossProductSoA
8.45% 131.78us 20 6.5880us 6.4320us 7.1040us embeddedCrossProductLocalObjectSoA
8.14% 127.07us 20 6.3530us 6.1120us 6.5920us eigenCrossProductSoA
7.89% 123.14us 20 6.1560us 6.0480us 6.3360us embeddedCrossProductSoA
Regarding the PTX, it's pretty well understood, except for the use of non-coherent cache, which I did not manage to get for the embedded (Eigen vector) versions. There are no duplicate loads, and the parameter loads are as one would expect, matching the data used in the kernel. The extreme case of local constructed object is not the fastest, in spite of loading only 2 parameters. The extra integer computations probably explain it. The embedded cross product is the fastest, in spite of not using the NC cache. Saving the construction of the Eigen vectors might explain this (I did not count the integer operations precisely).
The embedding code is generic, so we could throw any fixed sized Eigen element into the SoA. The macro based declaration is like this:
declare_SoA_template(SoA,
// columns: one value per element
SoA_column(double, x),
SoA_column(double, y),
SoA_column(double, z),
SoA_eigenColumn(Eigen::Vector3d, a),
SoA_eigenColumn(Eigen::Vector3d, b),
SoA_eigenColumn(Eigen::Vector3d, r),
SoA_column(uint16_t, colour),
SoA_column(int32_t, value),
SoA_column(double *, py),
// scalars: one value for the whole structure
SoA_scalar(const char *, description)
);
and the SoA is constructed and used like this (example from the locally constructed descriptor in the kernel, data block is already populated):
bool deviceConstructor = true;
testSoA::SoA soa(deviceConstructor, data, nElements);
soa[i].r() = soa[i].a().cross(soa[i].b());
I did not mange to get the cast trick to work here, so we might end up with slightly heavier syntax with parenthesis to access members (for uniformity). The boolean was necessary to get a device-only constructor, it is not used.
@VinInn , I'm finally getting back to the SoA business.
Since I'm still catching up, could you explain your example ?
and here a little example that compiles and run https://github.com/VinInn/ctest/blob/master/eigen/dynSoa.cpp
which also demonstrates that there is no problem in assigning to a temporary at least with Eigen.
In particular, where is the assignment to a temporary ?
After sitting with Eric and looking at his approach, I now understand that the problem is not in "assigning to a temporary", but in "passing a non-const reference to a temporary".
In fact, we don't need to do that - the element
we have in our SoA approach already behaves like a reference, so we can just pass it around by value.
We clearly need a good SoA abstraction to work with.
Some feature considerations for explicit memory
cudaMemcpyAsync()
With unified memory the situation would of course be simpler with