Open Quuxplusone opened 8 years ago
Bugzilla Link | PR30634 |
Status | NEW |
Importance | P enhancement |
Reported by | Alkis Evlogimenos (alkis@evlogimenos.com) |
Reported on | 2016-10-07 08:44:34 -0700 |
Last modified on | 2017-09-20 05:33:54 -0700 |
Version | trunk |
Hardware | PC All |
CC | amjad.aboud@intel.com, boaz.ouriel@intel.com, david.l.kreitzer@intel.com, echristo@gmail.com, hfinkel@anl.gov, jan.wassenberg@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, mkuper@google.com, rnk@google.com, spatel+llvm@rotateright.com, zvirack@gmail.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
Not exactly the same thing, but you might find this useful: http://clang.llvm.org/docs/AttributeReference.html#target-gnu-target
This looks useful indeed. Is the compiler smart enough to inline functions with the same set of features? More generically can it inline functions for implied features? Specifically, if f calls g and f has feature bmi2 and g feature bmi, in practice g can always be inlined because there is no arch that supports bmi2 but not bmi.
(In reply to comment #2)
> This looks useful indeed. Is the compiler smart enough to inline functions
> with the same set of features? More generically can it inline functions for
> implied features? Specifically, if f calls g and f has feature bmi2 and g
> feature bmi, in practice g can always be inlined because there is no arch
> that supports bmi2 but not bmi.
I don't know. Eric, do you know?
The compiler should be able to (depending on port) inline functions with compatible features. There's a hook in the inliner to check this sort of thing.
I think there is very little interest in tracking CPU feature availability in the control flow graph. We have function-level CPU feature tracking through attribute((target)), and I hope this is enough for most users.
Let me expand on what I am doing and why it is important.
I have a library approach to function multiversioning, which gives the
developer the ability to build binaries that run on a collection of microarchs
at close to optimal speeds while giving her full control for code size and
amortization of dispatch cost, and at the same time allowing for maximum code
sharing.
Please bear with the code; I tried to keep it to a minimum. I have put comments
where each piece of the puzzle is implemented and how it all fits together:
https://gist.github.com/alkis/7fd9678e64ae885fd9a4135ee7411360
To see the benefits of this approach, assume that our higher level algorithm
(or kernel) is BroadWordSearch. We want to find the position of the n'th 1 in a
binary stream:
template <class Ops>
uint64_t BroadWordSearch(Ops, uint64_t* words, uint64_t rank) {
uint64_t index = 0;
while (true) {
auto ones = Ops::popcnt64(*word);
if (rank < ones) break;
rank -= ones;
index += 64;
++word;
}
return Ops::select64(*word, rank);
}
SelectCPU(BroadWordSearch, &words, 123);
So now we have a single version of the algorithm, BroadWordSearch, while the
microarch specific bits are extracted out in a library. More importantly we can
also choose to put the microarch dispatch at BroadWordSearch, or at any higher
level function we want (to amortize the cost of dispatch). We can even put the
dispatch inside a constructor and instantiate N different subclasses of a base
effectively turning the table based dispatch into an indirect function call
dispatch at will. We can also choose how many copies of the code we are going
to have by changing the number of CPUs we are able to handle in SelectCPU().
Unfortunately, the compiler does not inline our ops nor does it compile our
kernel assuming it will really run under the target we specify. With
_allow_cpu_features intrinsic inside each case of the dispatch code, this
becomes a possibility.
Would it be enough for your purposes if we extended __attribute__((target)) to
allow stuff like this:
template <class Ops>
uint64_t __attribute__((target(Ops::cpu)))
BroadWordSearch(Ops, uint64_t* words, uint64_t rank) {
...
}
SelectCPU(BroadWordSearch, &words, 123);
If Clang was smart enough to take constexpr C strings as targets in addition to
string literals, would that be workable?
It will be enough but the library will become less ergonomic.
Instead of adding a template param only we would have to add the attribute as well. If the dispatch point is high enough this means annotating the transitive closure of all functions.
I am guessing that the function level switch of features is easier to implement than basic block switch. If that's the case, something like this will work:
__invoke_with_attribute((xxx), fn, args);
The semantics are:
calls fn(args...) as if fn had attribute((xxx))
Function level attributes are amazingly easier to implement. The other isn't impossible, but the ramifications inside the compiler are unknown (and Intel's documentation really isn't clear here either).
(In reply to comment #9)
> Function level attributes are amazingly easier to implement. The other isn't
> impossible, but the ramifications inside the compiler are unknown (and
> Intel's documentation really isn't clear here either).
The function-level attributes are easier to implement, but we either need to:
1. Forbid inlining when doing so would drop target capabilities
2. Allow inlining, but drop target capabilities when doing so
or we end up back with this more-general problem. Unfortunately, properly
implementing this is going to end up touching a large fraction of the common
passes (specifically, lots of MI-level passes -- there are now registers and
instructions that can only be used in part of the CFG -- and IR users of TTI
will need to make their queries CFG-dependent). Not a small project.
We already do the #1.
(In reply to comment #11)
> We already do the #1.
For reference, that TTI hook is:
bool areInlineCompatible(const Function *Caller,
const Function *Callee) const {
return (Caller->getFnAttribute("target-cpu") ==
Callee->getFnAttribute("target-cpu")) &&
(Caller->getFnAttribute("target-features") ==
Callee->getFnAttribute("target-features"));
}
which the inliner gets to from:
functionsHaveCompatibleAttributes()
(In reply to comment #12)
> (In reply to comment #11)
> > We already do the #1.
>
> For reference, that TTI hook is:
> bool areInlineCompatible(const Function *Caller,
> const Function *Callee) const {
> return (Caller->getFnAttribute("target-cpu") ==
> Callee->getFnAttribute("target-cpu")) &&
> (Caller->getFnAttribute("target-features") ==
> Callee->getFnAttribute("target-features"));
> }
>
> which the inliner gets to from:
> functionsHaveCompatibleAttributes()
Indeed. This could be better (we could allow inlining when the caller has a
superset of the capabilities of the callee).
lib/Target/X86/X86TargetTransformInfo.cpp:bool
X86TTIImpl::areInlineCompatible(const Function *Caller,
:)
Apologies if my compiler foo is rusty. Is it possible to do the following:
given function f that is called under __invoke_with_attribute, clone it, allow all functions to be inlined into it and if one is chosen for inlining clone it as well (in case its address is taken). Do this transitively.
Having parameterized target attributes as suggested in #7 would help a lot. AFAIK the only other alternative is to implement the 'portable' SIMD algorithms as a macro expanded inside two noinline functions with the SSE4 and AVX2 target attributes.