ericniebler / range-v3

Range library for C++14/17/20, basis for C++20's std::ranges
Other
4.08k stars 440 forks source link

ranges::to is significantly slower than direct construction #1337

Open cjdb opened 4 years ago

cjdb commented 4 years ago

I expected these two implementations to have equivalent or relatively comparable performance, but Google Benchmark tells a different story when run over 476k words. Is this a perf bug or have I done something horribly wrong in making these two algorithms equivalent?

auto palindrome_range_common(std::vector<std::string> const& words)
{
    auto is_palindrome = [](auto const& word) {
        return not ranges::empty(word)
           and ranges::equal(word, word | views::reverse);
    };

    auto palindrome_excalim = [&is_palindrome](auto const& word) {
        return is_palindrome(word) ? word + '!' : word;
    };

    auto result = words
                | views::transform(palindrome_excalim)
                | views::common;

    return std::vector<std::string>{
        ranges::begin(result),
        ranges::end(result)
    };  
}

auto palindrome_range_to(std::vector<std::string> const& words)
{
    auto is_palindrome = [](auto const& word) {
        return not ranges::empty(word)
           and ranges::equal(word, word | views::reverse);
    };

    auto palindrome_excalim = [&is_palindrome](auto const& word) {
        return is_palindrome(word) ? word + '!' : word;
    };

    return words
         | views::transform(palindrome_excalim)
         | ranges::to<std::vector>; // range-v3 extension   
}
Running ./benchmark-palindromes-o2
Run on (2 X 2684.42 MHz CPU s)
CPU Caches:
  L1 Data 32K (x1)
  L1 Instruction 32K (x1)
  L2 Unified 256K (x1)
  L3 Unified 4096K (x1)
Load Average: 0.69, 0.63, 0.42
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
benchmark_solutions/handrolled     12443168 ns     12439882 ns         7113    # base
benchmark_solutions/algorithm      25840577 ns     25831195 ns         3255
benchmark_solutions/range_common   10816578 ns     10813486 ns         7474
benchmark_solutions/range_to       23025614 ns     23020478 ns         3303
Running ./benchmark-palindromes-o3
Run on (2 X 2684.42 MHz CPU s)
CPU Caches:
  L1 Data 32K (x1)
  L1 Instruction 32K (x1)
  L2 Unified 256K (x1)
  L3 Unified 4096K (x1)
Load Average: 0.05, 0.36, 0.47
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
benchmark_solutions/handrolled      9594576 ns      9591506 ns         8880    # base
benchmark_solutions/algorithm      21961494 ns     21955033 ns         3845
benchmark_solutions/range_common    9305111 ns      9303263 ns         9115
benchmark_solutions/range_to       21293927 ns     21286420 ns         3954
ericniebler commented 4 years ago

This is what I see on my machine:

2019-12-05 11:49:35
Running perf/range_conversion
Run on (8 X 2900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 1.78, 1.95, 2.22
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
Words/RangeCommon  201502562 ns    200626714 ns            7
Words/RangeTo      241609633 ns    240499000 ns            8

This is at -O3. The effect is not as pronounced for me, but it is most definitely real. Interesting.