Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[feature-request] make vector intrinsics constexpr #30419

Open Quuxplusone opened 7 years ago

Quuxplusone commented 7 years ago
Bugzilla Link PR31446
Status NEW
Importance P enhancement
Reported by Gonzalo BG (gonzalo.gadeschi@gmail.com)
Reported on 2016-12-21 06:48:00 -0800
Last modified on 2020-08-21 03:35:00 -0700
Version trunk
Hardware All All
CC andrew.v.tischenko@gmail.com, bigcheesegs@gmail.com, craig.topper@gmail.com, DrTroll@gmx.de, erich.keane@intel.com, filcab@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, mkuper@google.com, richard-llvm@metafoo.co.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by PR20157, PR42461
See also PR30624, PR21713, PR27177, PR31917, PR20157, PR47249, PR47267

Linear algebra libraries like Eigen3 explicitly vectorize their code. However, because vector intrinsics (SSE, AVX,...) are not constexpr, it is impossible for them to provide an interface that can easily be used both at compile-time and run-time.

Duplicating all their code for running at compile-time is not an option.

A first step towards allowing these libraries to be usable within constant expressions would be to make the vector intrinsics constexpr and allowing their evaluation at compile-time.

Quuxplusone commented 7 years ago

FYI: X86 has been making progress with converting intrinsics (or certain use cases of them) to generic IR: within the headers, via CGBuiltin.h or during InstCombine. We also have constant folding for some target opcodes during DAGCombine. Basic arithmetic, shifts, conversions and shuffles in particular have been converted.

Quuxplusone commented 7 years ago

Do you have examples of intrinsics that are causing performance issues because they don't constant fold?

Quuxplusone commented 7 years ago

Do you have examples of intrinsics that are causing performance issues because they don't constant fold?

No, but this issue is not about that.

If they are not marked constexpr, they cannot be used in constexpr functions, so a library that uses cannot provide a constexpr API.

That is, because Eigen3 uses intrinsics internally for run-time behavior, it cannot provide a constexpr API that works at compile-time.

Quuxplusone commented 7 years ago
Relating bug 21713 which requested that that these be equivalent:

__m128i
emul84(__m128i a, int64_t b, const int ndx) {
    if (ndx)
        a = _mm_insert_epi32(a, b, 1);
    else
        a = _mm_insert_epi32(a, b, 0);
        return a;
}

__m128i
emul85(__m128i a, int64_t b, const int ndx) {
    a = _mm_insert_epi32(a, b, !!ndx);
        return a;
}

See also bug 27177 (is this a duplicate of that bug?).
Quuxplusone commented 7 years ago

Not really, this bug is only about the intrinsic builtin functions not being marked with the C++ constexpr keyword.

Quuxplusone commented 7 years ago

It's not as simple as just marking them constexpr. A constant evaluator for each intrinsic would need to implemented in clang in lib/AST/ExprConstant.cpp. The backend support for the intrinsics cannot be used for this.

This would be a huge amount of effort to support all of the SSE/AVX intrinsics. So if we were to do this we would need guidance on which ones are most important in order to prioritize.

Quuxplusone commented 7 years ago

That does sound like a titanic task indeed. I've posted this issue in the Eigen's bugzilla issue asking for help on prioritizing intrinsics.

I am however worried that constant evaluation support for the intrinsics is the wrong approach for tackling this.

I would be grateful if some of you could ping Richard Smith and Faisal Vali on this issue.

Would it be possible to:

This would allow clang to provide constexpr overloads of the intrinsics in plain C++ in the intrinsics headers, by providing implementations that are specific for the constant expression evaluator.

While not everyone could hack on the compiler, this would allow everybody to workaround limitations like this.

Quuxplusone commented 7 years ago

There is prior art about such intrinsics in the D language, which exposes the __ctfe intrinsic, quoting their docs:

The ctfe boolean pseudo-variable, which evaluates to true at compile time, but false at run time, can be used to provide an alternative execution path to avoid operations which are forbidden at compile time. Every usage of ctfe is evaluated before code generation and therefore has no run-time cost, even if no optimizer is used.

http://dlang.org/spec/function.html

Quuxplusone commented 7 years ago
FWIW here is the list of intrinsics used by Eigen3-trunk in case you want to
use them to prioritize.

I think, however, that a better solution would be to be able to detect when a
constexpr function is being evaluated at compile-time, and be able to branch on
that (e.g. Eigen3 already has these works around available for targets that do
not expose the intrinsics), hence I've filled a new bug for this:

https://llvm.org/bugs/show_bug.cgi?id=31917

The list of intrinsics used by Eigen:

_mm256_add_pd
_mm256_add_ps
_mm256_addsub_pd
_mm256_addsub_ps
_mm256_and_pd
_mm256_and_ps
_mm256_andnot_pd
_mm256_andnot_ps
_mm256_blend_pd
_mm256_blend_ps
_mm256_blendv_pd
_mm256_blendv_ps
_mm256_broadcast_pd
_mm256_broadcast_ps
_mm256_broadcast_sd
_mm256_broadcast_ss
_mm256_broadcastsd_pd
_mm256_castpd256_pd128
_mm256_castpd_ps
_mm256_castps128_ps256
_mm256_castps256_ps128
_mm256_castps_pd
_mm256_castps_si256
_mm256_castsi128_si256
_mm256_castsi256_pd
_mm256_castsi256_ps
_mm256_castsi256_si128
_mm256_ceil_pd
_mm256_ceil_ps
_mm256_cmp_pd
_mm256_cmp_ps
_mm256_cvtepi32_ps
_mm256_cvtpd_epi32
_mm256_cvtph_ps
_mm256_cvtps_epi32
_mm256_cvtps_ph
_mm256_cvttps_epi32
_mm256_div_pd
_mm256_div_ps
_mm256_extract_epi16
_mm256_extractf128_pd
_mm256_extractf128_ps
_mm256_extractf128_si256
_mm256_floor_pd
_mm256_floor_ps
_mm256_fmadd_pd
_mm256_fmadd_ps
_mm256_hadd_pd
_mm256_hadd_ps
_mm256_i32gather_pd
_mm256_i32gather_ps
_mm256_insertf128_ps
_mm256_insertf128_si256
_mm256_load_pd
_mm256_load_ps
_mm256_load_si256
_mm256_loadu2_m128d
_mm256_loadu_pd
_mm256_loadu_ps
_mm256_loadu_si256
_mm256_max_pd
_mm256_max_ps
_mm256_min_pd
_mm256_min_ps
_mm256_movedup_pd
_mm256_movehdup_ps
_mm256_moveldup_ps
_mm256_mul_pd
_mm256_mul_ps
_mm256_mullo_epi32
_mm256_or_pd
_mm256_or_ps
_mm256_permute2f128_pd
_mm256_permute2f128_ps
_mm256_permute2x128_si256
_mm256_permute_pd
_mm256_permute_ps
_mm256_round_pd
_mm256_round_ps
_mm256_rsqrt_ps
_mm256_set1_epi16
_mm256_set1_epi32
_mm256_set1_pd
_mm256_set1_ps
_mm256_set_epi16
_mm256_set_epi32
_mm256_set_pd
_mm256_set_ps
_mm256_setr_epi32
_mm256_setzero_pd
_mm256_setzero_ps
_mm256_setzero_si256
_mm256_shuffle_pd
_mm256_shuffle_ps
_mm256_slli_epi32
_mm256_sqrt_pd
_mm256_sqrt_ps
_mm256_srli_epi32
_mm256_store_pd
_mm256_store_ps
_mm256_store_si256
_mm256_storeu_pd
_mm256_storeu_ps
_mm256_storeu_si256
_mm256_sub_pd
_mm256_sub_ps
_mm256_unpackhi_epi16
_mm256_unpackhi_epi32
_mm256_unpackhi_epi64
_mm256_unpackhi_ps
_mm256_unpacklo_epi16
_mm256_unpacklo_epi32
_mm256_unpacklo_epi64
_mm256_unpacklo_ps
_mm256_xor_pd
_mm256_xor_ps
_mm512_abs_ps
_mm512_add_epi64
_mm512_add_pd
_mm512_add_ps
_mm512_and_pd
_mm512_and_ps
_mm512_and_si512
_mm512_andnot_pd
_mm512_andnot_ps
_mm512_broadcastsd_pd
_mm512_broadcastss_ps
_mm512_castsi512_pd
_mm512_castsi512_ps
_mm512_cmp_pd_mask
_mm512_cmp_ps_mask
_mm512_cvtepi32_ps
_mm512_cvtepu32_epi64
_mm512_cvtpd_epi64
_mm512_cvtph_ps
_mm512_cvtps_ph
_mm512_cvttps_epi32
_mm512_div_pd
_mm512_div_ps
_mm512_extractf32x4_ps
_mm512_extractf32x8_ps
_mm512_extractf64x4_pd
_mm512_extracti32x4_epi32
_mm512_floor_ps
_mm512_fmadd_pd
_mm512_fmadd_ps
_mm512_i32gather_pd
_mm512_i32gather_ps
_mm512_i32scatter_pd
_mm512_i32scatter_ps
_mm512_insertf32x4
_mm512_insertf32x8
_mm512_insertf64x2
_mm512_insertf64x4
_mm512_load_pd
_mm512_load_ps
_mm512_load_si512
_mm512_loadu_pd
_mm512_loadu_ps
_mm512_loadu_si512
_mm512_mask_blend_pd
_mm512_mask_blend_ps
_mm512_max_pd
_mm512_max_ps
_mm512_min_pd
_mm512_min_ps
_mm512_mul_pd
_mm512_mul_ps
_mm512_mul_round_pd
_mm512_mullo_epi32
_mm512_or_pd
_mm512_or_ps
_mm512_permute_ps
_mm512_permutexvar_pd
_mm512_permutexvar_ps
_mm512_rsqrt14_pd
_mm512_rsqrt14_ps
_mm512_rsqrt28_ps
_mm512_set1_epi32
_mm512_set1_epi64
_mm512_set1_pd
_mm512_set1_ps
_mm512_set_epi32
_mm512_set_pd
_mm512_set_ps
_mm512_setzero_pd
_mm512_setzero_ps
_mm512_shuffle_pd
_mm512_shuffle_ps
_mm512_slli_epi32
_mm512_slli_epi64
_mm512_sqrt_pd
_mm512_sqrt_ps
_mm512_srli_epi32
_mm512_store_pd
_mm512_store_ps
_mm512_storeu_pd
_mm512_storeu_ps
_mm512_storeu_si512
_mm512_sub_pd
_mm512_sub_ps
_mm512_undefined_pd
_mm512_undefined_ps
_mm512_unpackhi_pd
_mm512_unpackhi_ps
_mm512_unpacklo_pd
_mm512_unpacklo_ps
_mm512_xor_pd
_mm512_xor_ps
_mm_abs_epi32
_mm_add_epi32
_mm_add_pd
_mm_add_ps
_mm_add_sd
_mm_add_ss
_mm_addsub_pd
_mm_addsub_ps
_mm_alignr_epi8
_mm_and_pd
_mm_and_ps
_mm_and_si128
_mm_andnot_pd
_mm_andnot_ps
_mm_andnot_si128
_mm_blend_pd
_mm_blend_ps
_mm_blendv_epi8
_mm_blendv_pd
_mm_blendv_ps
_mm_broadcast_ss
_mm_castpd_ps
_mm_castpd_si128
_mm_castps_pd
_mm_castps_si128
_mm_castsi128_pd
_mm_castsi128_ps
_mm_ceil_pd
_mm_ceil_ps
_mm_cmpeq_epi32
_mm_cmpeq_pd
_mm_cmpeq_ps
_mm_cmpge_ps
_mm_cmpgt_epi32
_mm_cmpgt_pd
_mm_cmpgt_ps
_mm_cmple_ps
_mm_cmplt_epi32
_mm_cmplt_ps
_mm_cmpnge_ps
_mm_cvtepi32_pd
_mm_cvtepi32_ps
_mm_cvtm64_si64
_mm_cvtpd_ps
_mm_cvtps_pd
_mm_cvtsd_f64
_mm_cvtsi128_si32
_mm_cvtsi64_m64
_mm_cvtsi64_si32
_mm_cvtss_f32
_mm_cvttpd_epi32
_mm_cvttps_epi32
_mm_div_pd
_mm_div_ps
_mm_extract_epi16
_mm_extract_epi32
_mm_floor_pd
_mm_floor_ps
_mm_fmadd_pd
_mm_fmadd_ps
_mm_hadd_epi32
_mm_hadd_pd
_mm_hadd_ps
_mm_load1_ps
_mm_load_pd
_mm_load_pd1
_mm_load_ps
_mm_load_ps1
_mm_load_sd
_mm_load_si128
_mm_load_ss
_mm_loaddup_pd
_mm_loadh_pi
_mm_loadl_epi64
_mm_loadl_pi
_mm_loadu_pd
_mm_loadu_ps
_mm_loadu_si128
_mm_max_epi32
_mm_max_pd
_mm_max_ps
_mm_max_sd
_mm_max_ss
_mm_min_epi32
_mm_min_pd
_mm_min_ps
_mm_min_sd
_mm_min_ss
_mm_move_sd
_mm_move_ss
_mm_movedup_pd
_mm_movehdup_ps
_mm_movehl_ps
_mm_moveldup_ps
_mm_movelh_ps
_mm_mul_epu32
_mm_mul_pd
_mm_mul_ps
_mm_mul_sd
_mm_mul_ss
_mm_mullo_epi32
_mm_or_pd
_mm_or_ps
_mm_or_si128
_mm_permute_ps
_mm_prefetch
_mm_round_pd
_mm_round_ps
_mm_rsqrt_ps
_mm_set1_epi16
_mm_set1_epi32
_mm_set1_pd
_mm_set1_pi16
_mm_set1_ps
_mm_set_epi16
_mm_set_epi32
_mm_set_pd
_mm_set_pi16
_mm_set_ps
_mm_set_ps1
_mm_set_sd
_mm_set_ss
_mm_setr_epi32
_mm_setzero_pd
_mm_setzero_ps
_mm_setzero_si128
_mm_shuffle_epi32
_mm_shuffle_pd
_mm_shuffle_ps
_mm_slli_epi32
_mm_slli_epi64
_mm_sqrt_pd
_mm_sqrt_ps
_mm_sqrt_ss
_mm_srai_epi32
_mm_srli_epi32
_mm_srli_epi64
_mm_store_pd
_mm_store_ps
_mm_store_si128
_mm_storel_pi
_mm_storeu_pd
_mm_storeu_ps
_mm_storeu_si128
_mm_sub_epi32
_mm_sub_pd
_mm_sub_ps
_mm_unpackhi_epi16
_mm_unpackhi_epi32
_mm_unpackhi_epi64
_mm_unpackhi_pd
_mm_unpackhi_pi16
_mm_unpackhi_pi32
_mm_unpackhi_ps
_mm_unpacklo_epi16
_mm_unpacklo_epi32
_mm_unpacklo_epi64
_mm_unpacklo_pd
_mm_unpacklo_pi16
_mm_unpacklo_pi32
_mm_unpacklo_ps
_mm_xor_pd
_mm_xor_ps
_mm_xor_si128
Quuxplusone commented 7 years ago
Before even looking at any intrinsics, we'd need to get the vector types
working with constexpr:

typedef int __v4si __attribute__((__vector_size__(16)));

constexpr int sum_i32(__v4si x) {
  return x[0] + x[1] + x[2] + x[3];
}

constexpr __v4si add_i32(__v4si x, __v4si y) {
  return x + y;
}

error: constexpr function never produces a constant expression [-Winvalid-
constexpr]

subexpression not valid in a constant expression
Quuxplusone commented 4 years ago

Some initial constexpr support for vector types landed in D79755/30c16670e42, so we can begin investigating adding the constexpr tag to some basic sse/avx intrinsics that are implemented as generics (add/sub/mul/div/and/or/xor).

The next step would be to add constexpr support for vector initialization and the convertvector/shufflevector builtins.

Quuxplusone commented 4 years ago

To add constexpr tags to the intrinsic headers we need to ensure C builds can successfully ignore it - would it make sense to add support for a constexpr style tag to clang? I don't known if we have an equivalent to this already.

An alternative would be to add a #define wrapped inside a cplusplus/has_feature(cxx_constexpr) check - we already have the __DEFAULT_FN_ATTRS local defines which we could hijack + add constexpr variants on cpp builds.

Quuxplusone commented 4 years ago

A small initial demo patch: https://reviews.llvm.org/D86229