Cytnx-dev / Cytnx

Project Cytnx, A Cross-section of Python & C++,Tensor network library
Apache License 2.0
35 stars 14 forks source link

Refactor `Storage` class #508

Closed IvanaGyro closed 1 week ago

IvanaGyro commented 2 weeks ago

Consolidate all storage classes for each data type into a single template class.

This PR addresses the task outlined in #500 and builds upon #504 and #507.

IvanaGyro commented 2 weeks ago

Is the metafunction for a specialization just to implement is_complex_v<T> ?

It is a fairly complicated way to determine if a type is complex or not, and difficult to customize. Eg if you had a custom complex type that you wanted to use with the library, then it won't work well with this definition. How about

is_complex_v<const std::complex<double>> will be false in that case. I can't figure out the way to remove const and volatile without adding at least one more layer of class. I agree struct is_specialization is not necessary for implementing is_complex<T>. I just wanted to put that general helper class there and felt it was cool.

ianmccul commented 2 weeks ago
template <typename T>
struct is_complex<const T> : is_complex<T> {};

template <typename T>
struct is_complex<volatile T> : is_complex<T> {};

template <typename T>
struct is_complex<const volatile T> : is_complex<T> {};

Generally you probably should remove cv-qualifiers, references etc when calling the metafunction, rather than adding complexity to the metafunction to remove them. So if T might be const, or might be a reference type, then use is_complex<std::decay_t<T>>.

ianmccul commented 2 weeks ago

Or add another helper,

template <typename T>
struct is_complex_qualified : is_complex<std::decay_t<T>> {}
ianmccul commented 2 weeks ago

I have a solution that looks quite nice:

#include <type_traits>
#include <complex>

// Primary template, default to false
template <typename T, typename Enable = void>
struct is_complex : std::false_type {};

// if T is cv-qualified then forward to std::decay_t<T>
template <typename T>
struct is_complex<T, std::enable_if_t<!std::is_same_v<std::decay_t<T>, T>>> : is_complex<std::decay_t<T>> {};

// Specialization for std::complex<T>
template <typename T>
struct is_complex<std::complex<T>> : std::true_type {};

// Helper variable template
template <typename T>
inline constexpr bool is_complex_v = is_complex<T>::value;

// Tests with std::complex
static_assert(is_complex_v<std::complex<int>>, "std::complex<int> should be complex");                         // true
static_assert(is_complex_v<const std::complex<float>>, "const std::complex<float> should be complex");         // true
static_assert(is_complex_v<volatile std::complex<double>>, "volatile std::complex<double> should be complex"); // true
static_assert(is_complex_v<const volatile std::complex<long double>&>, "cv-ref std::complex<long double>");    // true
static_assert(is_complex_v<std::complex<int>&&>, "rvalue ref std::complex<int> should be complex");            // true
static_assert(!is_complex_v<int>, "int should not be complex");                                                // false
static_assert(!is_complex_v<float>, "float should not be complex");                                            // false

int main() {
}

godbolt link: https://godbolt.org/z/1xr5eEW1b

Customizing it with some new complex-like type is just a matter of making a new specialization is_complex<my_complex<T>> : std::true_type.

IvanaGyro commented 2 weeks ago

I will change #504 and this PR.

codecov[bot] commented 2 weeks ago

Codecov Report

Attention: Patch coverage is 61.26374% with 141 lines in your changes missing coverage. Please review.

Project coverage is 16.74%. Comparing base (eed01c2) to head (e537d60). Report is 14 commits behind head on dev-master.

Files with missing lines Patch % Lines
src/backend/StorageImplementation.cpp 61.43% 94 Missing and 46 partials :warning:
include/backend/Storage.hpp 0.00% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev-master #508 +/- ## ============================================== - Coverage 18.00% 16.74% -1.27% ============================================== Files 221 211 -10 Lines 52808 49014 -3794 Branches 19698 18712 -986 ============================================== - Hits 9508 8205 -1303 + Misses 38778 36660 -2118 + Partials 4522 4149 -373 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.


🚨 Try these New Features:

kaihsin commented 6 days ago

Actually yeah, why didn't we use memset instead? Because of unsafe op for complex?

On Tue, Nov 19, 2024, 06:52 Ian McCulloch @.***> wrote:

@.**** commented on this pull request.

In src/backend/utils_internal_cpu/Fill_cpu.hpp https://github.com/Cytnx-dev/Cytnx/pull/508#discussion_r1848143640:

    • enabled.
  • *
    • @tparam DType the data type of the elements in the range
  • *
    • @param first the beginning of the range
    • @param value the value to be assigned
    • @param count the number of elements to modify
  • */
  • template
  • void FillCpu(void *first, const DType &value, cytnx_uint64 count) {
  • DType typed_first = reinterpret_cast<DType >(first); +#ifdef UNI_OMP
  • pragma omp parallel for schedule(dynamic)

    +#endif

  • for (cytnx_uint64 i = 0; i < count; i++) {
  • *typed_first++ = value;

Yes, dynamic scheduling is only useful when the execution time of each iteration in the loop will vary, and you want to avoid the situation where some threads have finished their work, but there is one thread still running (and therefore blocking everything else). But what kills the performance of the dynamic scheduling benchmark is that the default chunk size (the number of elements of the array processed at a time) is 1, for dynamic scheduling. So each thread is only processing one element at a time. If you specify a sensible chunk size, say 1000000, then dynamic scheduling will be basically indistinguishable from static scheduling here. But of course there is no reason to use dynamic scheduling anyway.

I am slightly surprised that you see a benefit from OpenMP here at all - setting an array of memory to some value is going to be purely memory bandwidth limited. A single thread will easily saturate the memory bandwidth on a modern CPU. I modified the benchmark to add a non-OpenMP version - I also needed the DoNotOptimize() function because at -O3 the compiler was completely optimizing away the loop! This shows that if the array size is small enough to fit inside the L3 cache (about 16 MB typically), then OpenMP tops out at a bit under a factor 2 faster, so probably useful. But that is not the common case - it depends on having the array already in the cache. For array sizes larger than the L3 cache, there is about 30% improvement with 4 threads - using more gets slower again, presumably the threads are fighting each other for memory bandwidth. I think an optimized single-thread copy, that uses SIMD instructions, would be best here. Surprisingly, I wasn't able to get gcc to make a SIMD-vectorized loop, not sure what I was doing wrong.

In summary, static omp with a small number of threads is slightly useful, but I suspect a SIMD optimized version would be better still. I can't imagine any situation where the benefit from using openmp here is noticable in a realistic code. Initializing memory to 0 is a special case that is worth treating separately, and memset will likely be faster than any openmp solution.

The benchmarks: Array size Threads CPU (ms) omp-static (ms) omp-dynamic (ms) 10'000'000 (< L3 cache) 1 2.40328 2.41689 2.38274 10000000 2 2.53097 2.06814 2.05014 10000000 4 2.40613 1.67919 1.59296 100'000'000 (> L3 cache) 1 32.6479 31.8224 33.9212 100'000'000 2 33.2148 29.2628 28.9037 100'000'000 4 33.7245 21.7398 22.4597

The code:

// compile command: g++ -std=c++17 -fopenmp -O3 -o fill fill.cpp && ./fill 1000000 4

include

include

include

include

using namespace std; // from https://github.com/google/benchmark/blob/62a321d6dc377e0ba9c712b6a8d64360616de564/include/benchmark/benchmark.h#L525template inline void DoNotOptimize(Tp const& value) { asm volatile("" : : "r,m"(value) : "memory"); }

template void FillCpu(void first, DType value, size_t count) { DType typed_first = reinterpret_cast<DType >(first); for (int i = 0; i < count; ++i) { typed_first[i] = value; } } template void FillCpuStatic(void first, const DType &value, size_t count) { DType typed_first = reinterpret_cast<DType >(first);

pragma omp parallel for schedule(static)

for (int i = 0; i < count; ++i) { typed_first[i] = value; } } template void FillCpuDynamic(void first, const DType &value, size_t count) { DType typed_first = reinterpret_cast<DType *>(first);

pragma omp parallel for schedule(dynamic,1000000)

for (int i = 0; i < count; ++i) { typed_first[i] = value; } } int main(int argc, char* argv) { if (argc < 3) { cout << "expected: fill \n"; return 1; } int count = atoi(argv[1]); int nthreads = atoi(argv[2]); omp_set_num_threads(nthreads); int num_iterations = 10'000'000'000ll / count; cout << "Using " << num_iterations << " iterations." << endl; int ptr = reinterpret_cast<int >(malloc(sizeof(int) count)); int value = 10;

{ auto start = chrono::high_resolution_clock::now(); for (int iter = 0; iter < num_iterations; ++iter) { FillCpu(reinterpret_cast<void >(ptr), value, count); DoNotOptimize(ptr[0]); } auto end = chrono::high_resolution_clock::now(); const std::chrono::duration total_time = end - start; cout << "Total time for FillCpu: " << (total_time.count()1000/num_iterations) << " milliseconds" << endl; }

{ auto start = chrono::high_resolution_clock::now(); for (int iter = 0; iter < num_iterations; ++iter) { FillCpuStatic(reinterpret_cast<void >(ptr), value, count); DoNotOptimize(ptr[0]); } auto end = chrono::high_resolution_clock::now(); const std::chrono::duration total_time = end - start; cout << "Total time for FillCpuStatic: " << (total_time.count()1000/num_iterations) << " milliseconds" << endl; }

{ auto start = chrono::high_resolution_clock::now(); for (int iter = 0; iter < num_iterations; ++iter) { FillCpuDynamic(reinterpret_cast<void >(ptr), value, count); DoNotOptimize(ptr[0]); } auto end = chrono::high_resolution_clock::now(); const std::chrono::duration total_time = end - start; cout << "Total time for FillCpuDynamic: " << (total_time.count()1000/num_iterations) << " milliseconds" << endl; } }

— Reply to this email directly, view it on GitHub https://github.com/Cytnx-dev/Cytnx/pull/508#discussion_r1848143640, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFCX3SLSTNLTLLM46XUBN2T2BMQ7ZAVCNFSM6AAAAABRQNR5LGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDINBVGEYTGMBXGA . You are receiving this because you commented.Message ID: @.***>

ianmccul commented 6 days ago

Or just use std::fill(). On gcc versions up to 12, it replaces std::fill with meset anyway,

void Zero(std::complex<double>* arr, std::size_t size) {
    std::fill(arr, arr + size, std::complex<double>{});
}

It doesn't on gcc 13+, which is interesting, but presumably the code it generates is even more efficient than memset? or perhaps it is an inline version of what memset is doing anyway?) https://godbolt.org/z/eece1banY (change the compiler to gcc 12.4 to see the difference!)

std::complex satisfies TriviallyCopyable (since C++23, but in practice it was always the case). So these games with memset etc are OK. But also not really needed, the compiler should be able to generate good code.