only evaluate values of non-type template arguments once

Quuxplusone commented 12 years ago


Bugzilla Link	PR12553
Status	NEW
Importance	P enhancement
Reported by	AlisdairM (public@alisdairm.net)
Reported on	2012-04-13 20:26:11 -0700
Last modified on	2020-05-25 06:46:31 -0700
Version	trunk
Hardware	Macintosh MacOS X
CC	alisdairm@me.com, anton@korobeynikov.info, dgregor@apple.com, llvm-bugs@lists.llvm.org, richard-llvm@metafoo.co.uk
Fixed by commit(s)
Attachments	`evolved.cpp` (8049 bytes, application/octet-stream)
Blocks
Blocked by
See also

Created attachment 8387
metaprogramming torture test to compute prime numbers

The attached program is one of my recreational C++11 programs, it tries to
compute a large array of prime numbers entirely at compile time, and has been
optimized to produce the largest number possible in the shortest compile-time,
minimizing compiler resources.

As currently configured, it compiles on gcc 4.7 in approximately one second,
but with my current version of Clang in just over 4 minutes.  Configuration is
essentially one constant towards the end of the file that determines the
maximum range of numbers to search.  The limiting factor with GCC is to avoid
swap-file thrashing, which allows for growing by a factor of another 6 powers
of 2.

The Clang version is clearly CPU bound, using only around 50Mb RAM in the
attached configuration and not coming close to swapping to page file.

Clang configuration:
   clang version 3.1 (trunk 153754)
   Target: x86_64-apple-darwin11.3.0
   Thread model: posix

Running on OSX Lion.

Quuxplusone commented 12 years ago

Attached evolved.cpp (8049 bytes, application/octet-stream): metaprogramming torture test to compute prime numbers

Quuxplusone commented 12 years ago

The problem is that count_primes(16384) is being evaluated a large number of
times. We are apparently not storing the evaluated value for a non-type
template argument, and are instead re-evaluating it each time we need it.

(240x slower doesn't sound so bad when we're doing 16384x as much work...)

With this patch (and the include of <iostream> and main removed) clang -fsyntax-
only takes 47% less time to handle this code than g++ -fsyntax-only on my
machine:

@@ -227,11 +227,12 @@

 template<unsigned N>
 struct PrimeArrayReversed {
-   PrimeArrayReverseImpl< typename GeneratePrimePack<count_primes(N)>::type>
const VALUE;
+   static constexpr unsigned Primes = count_primes(N);
+   PrimeArrayReverseImpl< typename GeneratePrimePack<Primes>::type> const
VALUE;

    constexpr PrimeArrayReversed() : VALUE{} {}

-   constexpr unsigned operator[](unsigned index) { return
VALUE.d_data[count_primes(N)-index]; }
+   constexpr unsigned operator[](unsigned index) { return VALUE.d_data[Primes-
index]; }
 };

 #define UNREVERSE_ARRAY
@@ -248,14 +249,14 @@

 template<unsigned N>
 struct PrimeArray {
-   PrimeArrayImpl< N, typename GenerateLogPack<count_primes(N)>::type > const
VALUE;
+   PrimeArrayImpl< N, typename
GenerateLogPack<PrimeArrayReversed<N>::Primes>::type > const VALUE;

    constexpr PrimeArray() : VALUE(PrimeArrayReversed<N>{}) {}
 };
 #else
 template<unsigned N>
 struct PrimeArray {
-   PrimeArrayReverseImpl<typename GeneratePrimePack<count_primes(N)>::type>
const VALUE;
+   PrimeArrayReverseImpl<typename
GeneratePrimePack<PrimeArrayReversed<N>::Primes>::type> const VALUE;

    constexpr PrimeArray() : VALUE{} {}
 };

Quuxplusone commented 4 years ago

Tried this again today, and the gcc performance advantage is still present, but much lower - although I had to reduce the test bound from 2 << 14 to 2 << 10 to sit within the default recursion limits in the current compiler. I think this ticket can be closed as resolved, as we are no longer looking at multiple orders of magnitude difference.

Quuxplusone / LLVMBugzillaTest

only evaluate values of non-type template arguments once #12664