Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

add _allow_cpu_features intrinsic (like the intel compiler) #29609

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR30634
Status NEW
Importance P enhancement
Reported by Alkis Evlogimenos (alkis@evlogimenos.com)
Reported on 2016-10-07 08:44:34 -0700
Last modified on 2017-09-20 05:33:54 -0700
Version trunk
Hardware PC All
CC amjad.aboud@intel.com, boaz.ouriel@intel.com, david.l.kreitzer@intel.com, echristo@gmail.com, hfinkel@anl.gov, jan.wassenberg@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, mkuper@google.com, rnk@google.com, spatel+llvm@rotateright.com, zvirack@gmail.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
https://software.intel.com/en-us/articles/new-intrinsic-allow-cpu-features-support

This intrinsic, among other things, makes it possible to build composable
components that are optimal for all microarchitectures in a single binary
(something like function multi versioning but better).

Bonus points if all instructions dominated by a call to the intrinsic are also
scheduled as per the best matching CPU:

if (skylake) {
  __allow_cpu_features(_FEATURE_BMI2);
  // do stuff
  // use bmi2 intrinsics here even if -march=X doesn't support them
}

Code in "// do stuff, etc" should be tuned as if the cpu was haswell even if
the rest of the code is tuned for westmere. The rationale is that haswell is
the earliest processor supporting FEATURE_BMI2 and it is not introduced earlier
than the per file processor.
Quuxplusone commented 8 years ago

Not exactly the same thing, but you might find this useful: http://clang.llvm.org/docs/AttributeReference.html#target-gnu-target

Quuxplusone commented 8 years ago

This looks useful indeed. Is the compiler smart enough to inline functions with the same set of features? More generically can it inline functions for implied features? Specifically, if f calls g and f has feature bmi2 and g feature bmi, in practice g can always be inlined because there is no arch that supports bmi2 but not bmi.

Quuxplusone commented 8 years ago
(In reply to comment #2)
> This looks useful indeed. Is the compiler smart enough to inline functions
> with the same set of features? More generically can it inline functions for
> implied features? Specifically, if f calls g and f has feature bmi2 and g
> feature bmi, in practice g can always be inlined because there is no arch
> that supports bmi2 but not bmi.

I don't know. Eric, do you know?
Quuxplusone commented 8 years ago

The compiler should be able to (depending on port) inline functions with compatible features. There's a hook in the inliner to check this sort of thing.

Quuxplusone commented 8 years ago

I think there is very little interest in tracking CPU feature availability in the control flow graph. We have function-level CPU feature tracking through attribute((target)), and I hope this is enough for most users.

Quuxplusone commented 8 years ago
Let me expand on what I am doing and why it is important.

I have a library approach to function multiversioning, which gives the
developer the ability to build binaries that run on a collection of microarchs
at close to optimal speeds while giving her full control for code size and
amortization of dispatch cost, and at the same time allowing for maximum code
sharing.

Please bear with the code; I tried to keep it to a minimum. I have put comments
where each piece of the puzzle is implemented and how it all fits together:

https://gist.github.com/alkis/7fd9678e64ae885fd9a4135ee7411360

To see the benefits of this approach, assume that our higher level algorithm
(or kernel) is BroadWordSearch. We want to find the position of the n'th 1 in a
binary stream:

template <class Ops>
uint64_t BroadWordSearch(Ops, uint64_t* words, uint64_t rank) {
  uint64_t index = 0;
  while (true) {
    auto ones = Ops::popcnt64(*word);
    if (rank < ones) break;
    rank -= ones;
    index += 64;
    ++word;
  }
  return Ops::select64(*word, rank);
}

SelectCPU(BroadWordSearch, &words, 123);

So now we have a single version of the algorithm, BroadWordSearch, while the
microarch specific bits are extracted out in a library. More importantly we can
also choose to put the microarch dispatch at BroadWordSearch, or at any higher
level function we want (to amortize the cost of dispatch). We can even put the
dispatch inside a constructor and instantiate N different subclasses of a base
effectively turning the table based dispatch into an indirect function call
dispatch at will. We can also choose how many copies of the code we are going
to have by changing the number of CPUs we are able to handle in SelectCPU().
Unfortunately, the compiler does not inline our ops nor does it compile our
kernel assuming it will really run under the target we specify. With
_allow_cpu_features intrinsic inside each case of the dispatch code, this
becomes a possibility.
Quuxplusone commented 8 years ago
Would it be enough for your purposes if we extended __attribute__((target)) to
allow stuff like this:

template <class Ops>
uint64_t __attribute__((target(Ops::cpu)))
BroadWordSearch(Ops, uint64_t* words, uint64_t rank) {
  ...
}

SelectCPU(BroadWordSearch, &words, 123);

If Clang was smart enough to take constexpr C strings as targets in addition to
string literals, would that be workable?
Quuxplusone commented 8 years ago

It will be enough but the library will become less ergonomic.

Instead of adding a template param only we would have to add the attribute as well. If the dispatch point is high enough this means annotating the transitive closure of all functions.

I am guessing that the function level switch of features is easier to implement than basic block switch. If that's the case, something like this will work:

__invoke_with_attribute((xxx), fn, args);

The semantics are:

calls fn(args...) as if fn had attribute((xxx))

Quuxplusone commented 8 years ago

Function level attributes are amazingly easier to implement. The other isn't impossible, but the ramifications inside the compiler are unknown (and Intel's documentation really isn't clear here either).

Quuxplusone commented 8 years ago
(In reply to comment #9)
> Function level attributes are amazingly easier to implement. The other isn't
> impossible, but the ramifications inside the compiler are unknown (and
> Intel's documentation really isn't clear here either).

The function-level attributes are easier to implement, but we either need to:

 1. Forbid inlining when doing so would drop target capabilities
 2. Allow inlining, but drop target capabilities when doing so

or we end up back with this more-general problem. Unfortunately, properly
implementing this is going to end up touching a large fraction of the common
passes (specifically, lots of MI-level passes -- there are now registers and
instructions that can only be used in part of the CFG -- and IR users of TTI
will need to make their queries CFG-dependent). Not a small project.
Quuxplusone commented 8 years ago

We already do the #1.

Quuxplusone commented 8 years ago
(In reply to comment #11)
> We already do the #1.

For reference, that TTI hook is:
  bool areInlineCompatible(const Function *Caller,
                           const Function *Callee) const {
    return (Caller->getFnAttribute("target-cpu") ==
            Callee->getFnAttribute("target-cpu")) &&
           (Caller->getFnAttribute("target-features") ==
            Callee->getFnAttribute("target-features"));
  }

which the inliner gets to from:
functionsHaveCompatibleAttributes()
Quuxplusone commented 8 years ago
(In reply to comment #12)
> (In reply to comment #11)
> > We already do the #1.
>
> For reference, that TTI hook is:
>   bool areInlineCompatible(const Function *Caller,
>                            const Function *Callee) const {
>     return (Caller->getFnAttribute("target-cpu") ==
>             Callee->getFnAttribute("target-cpu")) &&
>            (Caller->getFnAttribute("target-features") ==
>             Callee->getFnAttribute("target-features"));
>   }
>
> which the inliner gets to from:
> functionsHaveCompatibleAttributes()

Indeed. This could be better (we could allow inlining when the caller has a
superset of the capabilities of the callee).
Quuxplusone commented 8 years ago
lib/Target/X86/X86TargetTransformInfo.cpp:bool
X86TTIImpl::areInlineCompatible(const Function *Caller,

:)
Quuxplusone commented 8 years ago

Apologies if my compiler foo is rusty. Is it possible to do the following:

given function f that is called under __invoke_with_attribute, clone it, allow all functions to be inlined into it and if one is chosen for inlining clone it as well (in case its address is taken). Do this transitively.

Quuxplusone commented 7 years ago

7 sounds very useful for a current SIMD pain point: generating AVX2 and SSE4 implementations from the same source code. We could use a thin template wrapper over intrinsics for that, but each calling function and the template itself would require target attributes.

Having parameterized target attributes as suggested in #7 would help a lot. AFAIK the only other alternative is to implement the 'portable' SIMD algorithms as a macro expanded inside two noinline functions with the SSE4 and AVX2 target attributes.