Closed IvanaGyro closed 1 week ago
Is the metafunction for a specialization just to implement
is_complex_v<T>
?It is a fairly complicated way to determine if a type is complex or not, and difficult to customize. Eg if you had a custom complex type that you wanted to use with the library, then it won't work well with this definition. How about
is_complex_v<const std::complex<double>>
will be false
in that case. I can't figure out the way to remove const
and volatile
without adding at least one more layer of class. I agree struct is_specialization
is not necessary for implementing is_complex<T>
. I just wanted to put that general helper class there and felt it was cool.
template <typename T>
struct is_complex<const T> : is_complex<T> {};
template <typename T>
struct is_complex<volatile T> : is_complex<T> {};
template <typename T>
struct is_complex<const volatile T> : is_complex<T> {};
Generally you probably should remove cv-qualifiers, references etc when calling the metafunction, rather than adding complexity to the metafunction to remove them. So if T might be const, or might be a reference type, then use is_complex<std::decay_t<T>>
.
Or add another helper,
template <typename T>
struct is_complex_qualified : is_complex<std::decay_t<T>> {}
I have a solution that looks quite nice:
#include <type_traits>
#include <complex>
// Primary template, default to false
template <typename T, typename Enable = void>
struct is_complex : std::false_type {};
// if T is cv-qualified then forward to std::decay_t<T>
template <typename T>
struct is_complex<T, std::enable_if_t<!std::is_same_v<std::decay_t<T>, T>>> : is_complex<std::decay_t<T>> {};
// Specialization for std::complex<T>
template <typename T>
struct is_complex<std::complex<T>> : std::true_type {};
// Helper variable template
template <typename T>
inline constexpr bool is_complex_v = is_complex<T>::value;
// Tests with std::complex
static_assert(is_complex_v<std::complex<int>>, "std::complex<int> should be complex"); // true
static_assert(is_complex_v<const std::complex<float>>, "const std::complex<float> should be complex"); // true
static_assert(is_complex_v<volatile std::complex<double>>, "volatile std::complex<double> should be complex"); // true
static_assert(is_complex_v<const volatile std::complex<long double>&>, "cv-ref std::complex<long double>"); // true
static_assert(is_complex_v<std::complex<int>&&>, "rvalue ref std::complex<int> should be complex"); // true
static_assert(!is_complex_v<int>, "int should not be complex"); // false
static_assert(!is_complex_v<float>, "float should not be complex"); // false
int main() {
}
godbolt link: https://godbolt.org/z/1xr5eEW1b
Customizing it with some new complex-like type is just a matter of making a new specialization is_complex<my_complex<T>> : std::true_type
.
I will change #504 and this PR.
Attention: Patch coverage is 61.26374%
with 141 lines
in your changes missing coverage. Please review.
Project coverage is 16.74%. Comparing base (
eed01c2
) to head (e537d60
). Report is 14 commits behind head on dev-master.
Files with missing lines | Patch % | Lines |
---|---|---|
src/backend/StorageImplementation.cpp | 61.43% | 94 Missing and 46 partials :warning: |
include/backend/Storage.hpp | 0.00% | 0 Missing and 1 partial :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
🚨 Try these New Features:
Actually yeah, why didn't we use memset instead? Because of unsafe op for complex?
On Tue, Nov 19, 2024, 06:52 Ian McCulloch @.***> wrote:
@.**** commented on this pull request.
In src/backend/utils_internal_cpu/Fill_cpu.hpp https://github.com/Cytnx-dev/Cytnx/pull/508#discussion_r1848143640:
- enabled.
- *
- @tparam DType the data type of the elements in the range
- *
- @param first the beginning of the range
- @param value the value to be assigned
- @param count the number of elements to modify
- */
- template
- void FillCpu(void *first, const DType &value, cytnx_uint64 count) {
- DType typed_first = reinterpret_cast<DType >(first); +#ifdef UNI_OMP
pragma omp parallel for schedule(dynamic)
+#endif
- for (cytnx_uint64 i = 0; i < count; i++) {
- *typed_first++ = value;
Yes, dynamic scheduling is only useful when the execution time of each iteration in the loop will vary, and you want to avoid the situation where some threads have finished their work, but there is one thread still running (and therefore blocking everything else). But what kills the performance of the dynamic scheduling benchmark is that the default chunk size (the number of elements of the array processed at a time) is 1, for dynamic scheduling. So each thread is only processing one element at a time. If you specify a sensible chunk size, say 1000000, then dynamic scheduling will be basically indistinguishable from static scheduling here. But of course there is no reason to use dynamic scheduling anyway.
I am slightly surprised that you see a benefit from OpenMP here at all - setting an array of memory to some value is going to be purely memory bandwidth limited. A single thread will easily saturate the memory bandwidth on a modern CPU. I modified the benchmark to add a non-OpenMP version - I also needed the DoNotOptimize() function because at -O3 the compiler was completely optimizing away the loop! This shows that if the array size is small enough to fit inside the L3 cache (about 16 MB typically), then OpenMP tops out at a bit under a factor 2 faster, so probably useful. But that is not the common case - it depends on having the array already in the cache. For array sizes larger than the L3 cache, there is about 30% improvement with 4 threads - using more gets slower again, presumably the threads are fighting each other for memory bandwidth. I think an optimized single-thread copy, that uses SIMD instructions, would be best here. Surprisingly, I wasn't able to get gcc to make a SIMD-vectorized loop, not sure what I was doing wrong.
In summary, static omp with a small number of threads is slightly useful, but I suspect a SIMD optimized version would be better still. I can't imagine any situation where the benefit from using openmp here is noticable in a realistic code. Initializing memory to 0 is a special case that is worth treating separately, and memset will likely be faster than any openmp solution.
The benchmarks: Array size Threads CPU (ms) omp-static (ms) omp-dynamic (ms) 10'000'000 (< L3 cache) 1 2.40328 2.41689 2.38274 10000000 2 2.53097 2.06814 2.05014 10000000 4 2.40613 1.67919 1.59296 100'000'000 (> L3 cache) 1 32.6479 31.8224 33.9212 100'000'000 2 33.2148 29.2628 28.9037 100'000'000 4 33.7245 21.7398 22.4597
The code:
// compile command: g++ -std=c++17 -fopenmp -O3 -o fill fill.cpp && ./fill 1000000 4
include
include
include
include
using namespace std; // from https://github.com/google/benchmark/blob/62a321d6dc377e0ba9c712b6a8d64360616de564/include/benchmark/benchmark.h#L525template
inline void DoNotOptimize(Tp const& value) { asm volatile("" : : "r,m"(value) : "memory"); } template
void FillCpu(void first, DType value, size_t count) { DType typed_first = reinterpret_cast<DType >(first); for (int i = 0; i < count; ++i) { typed_first[i] = value; } } template void FillCpuStatic(void first, const DType &value, size_t count) { DType typed_first = reinterpret_cast<DType >(first);pragma omp parallel for schedule(static)
for (int i = 0; i < count; ++i) { typed_first[i] = value; } } template
void FillCpuDynamic(void first, const DType &value, size_t count) { DType typed_first = reinterpret_cast<DType *>(first); pragma omp parallel for schedule(dynamic,1000000)
for (int i = 0; i < count; ++i) { typed_first[i] = value; } } int main(int argc, char* argv) { if (argc < 3) { cout << "expected: fill
ptr = reinterpret_cast<int >(malloc(sizeof(int) count)); int value = 10; \n"; return 1; } int count = atoi(argv[1]); int nthreads = atoi(argv[2]); omp_set_num_threads(nthreads); int num_iterations = 10'000'000'000ll / count; cout << "Using " << num_iterations << " iterations." << endl; int { auto start = chrono::high_resolution_clock::now(); for (int iter = 0; iter < num_iterations; ++iter) { FillCpu(reinterpret_cast<void >(ptr), value, count); DoNotOptimize(ptr[0]); } auto end = chrono::high_resolution_clock::now(); const std::chrono::duration
total_time = end - start; cout << "Total time for FillCpu: " << (total_time.count() 1000/num_iterations) << " milliseconds" << endl; }{ auto start = chrono::high_resolution_clock::now(); for (int iter = 0; iter < num_iterations; ++iter) { FillCpuStatic(reinterpret_cast<void >(ptr), value, count); DoNotOptimize(ptr[0]); } auto end = chrono::high_resolution_clock::now(); const std::chrono::duration
total_time = end - start; cout << "Total time for FillCpuStatic: " << (total_time.count() 1000/num_iterations) << " milliseconds" << endl; }{ auto start = chrono::high_resolution_clock::now(); for (int iter = 0; iter < num_iterations; ++iter) { FillCpuDynamic(reinterpret_cast<void >(ptr), value, count); DoNotOptimize(ptr[0]); } auto end = chrono::high_resolution_clock::now(); const std::chrono::duration
total_time = end - start; cout << "Total time for FillCpuDynamic: " << (total_time.count() 1000/num_iterations) << " milliseconds" << endl; } }— Reply to this email directly, view it on GitHub https://github.com/Cytnx-dev/Cytnx/pull/508#discussion_r1848143640, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFCX3SLSTNLTLLM46XUBN2T2BMQ7ZAVCNFSM6AAAAABRQNR5LGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDINBVGEYTGMBXGA . You are receiving this because you commented.Message ID: @.***>
Or just use std::fill()
. On gcc versions up to 12, it replaces std::fill
with meset anyway,
void Zero(std::complex<double>* arr, std::size_t size) {
std::fill(arr, arr + size, std::complex<double>{});
}
It doesn't on gcc 13+, which is interesting, but presumably the code it generates is even more efficient than memset? or perhaps it is an inline version of what memset is doing anyway?) https://godbolt.org/z/eece1banY (change the compiler to gcc 12.4 to see the difference!)
std::complex
satisfies TriviallyCopyable (since C++23, but in practice it was always the case). So these games with memset etc are OK. But also not really needed, the compiler should be able to generate good code.
Consolidate all storage classes for each data type into a single template class.
This PR addresses the task outlined in #500 and builds upon #504 and #507.