Open 4creators opened 4 years ago
Just a note: in Mono we use binary search for S.R.I. intrinsics (but we have a small subset), O(1) lookup would be better I guess.
@jkotas @CarolEidt
Do you think that there is a chance to have it implemented and shipped with v3.1 and eventually backported to v3.0? The current performance of JIT with HW intrinsics is ... a bit disappointing - one can feel it using cloud functions for image processing. The alternative would be to have crossgen supporting intrinsics for R2R compilation with preset CPU architecture targets shipped with v3.1.
So the next steps would be:
It is definitely worth improving the lookup for the intrinsics. That said, I'm not sure how that prioritizes against the other work that's being done in the JIT.
Supporting intrinsics in crossgen is also a reasonable thing to do (and has already been done to a limited extent for SPC.dll in dotnet/coreclr#24689). dotnet/runtime#11689 captures the remaining issue(s).
I'm not sure how that prioritizes against the other work that's being done in the JIT.
That's where the data about the impact would come handy.
Looks like we still don't have data measuring the actual impact of this lookup on jit throughput. I'll see if I can come up with something.
I would suspect the worst case is AVX2.Xor
which is going to end up doing 371 checks for if isa == hwIntrinsicInfoArray[i].isa
: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.cpp#L285-L335
It will then do 66 more checks where the ISA matches but the name doesn't, each of which involves a strcmp
call.
We could simplify a good bit of this by removing the isa
checks, which could simply be a small table that retrns the first/last NamedIntrinsic for a given ISA.
That would take it down to just the strcmp
calls at least.
We could then probably optimize the string checks a bit, but that would be more complex.
Probably not a great test, but on Avx2_ro, we spend 2232 ms jitting, and somewhere around 2 ms in lookupId
.
We also have the src/coreclr/tests/src/JIT/Performance/CodeQuality/HWIntrinsic/X86/PacketTracer benchmark if we want to test a slightly more "real world" example of intrinsic usage.
However, it sounds like there isn't a huge penalty for it overall and while we could speed it up, it isn't likely to make a noticeable impact.
Was there anything else from the HWIntrinsic specific code paths that was taking a large amount of time (I think import
, codegen
, and emit
are the biggest; everything else is largely shared).
Nothing really jumps out. By exclusive profile (under jitNativeCode
)
Inclusive: codegen is 17%, importer 14%, morph 10%, linear scan 9%, ...
I'm going to move this to future as it doesn't seem urgent to address now.
During the design phase of JIT Hardware Intrinsics compiler bits one inefficient, temporary algorithm was allowed to be used for searching for intrinsics function names/ids. The naive search algorithm currently used works in O(n) time for each intrinsic imported, while it is possible to do this search in O(1) with very low constant overhead. Usually, this problem is hitting particularly hard in functions that use multiple intrinsics instructions since the search time is equal to
t = number_of_intrinsics_used * O(n)
. This is particularly exacerbated in applications that require fast startup time: cloud lambda functions, command-line tools i.e. PowerShell Core.The function implementing an algorithm which slipped to the release is the following:
https://github.com/dotnet/coreclr/blob/6de88d4f5d291269f82e3dd1aa39cee026725dfe/src/jit/hwintrinsic.cpp#L186
and the algorithm used:
https://github.com/dotnet/coreclr/blob/6de88d4f5d291269f82e3dd1aa39cee026725dfe/src/jit/hwintrinsic.cpp#L210
Previously discussed solutions included using a hashtable or/and creating fast binary preselection due to the fixed nature of search terms.
@AndyAyersMS @CarolEidt @fiigii @tannergooding
category:implementation theme:vector-codegen skill-level:beginner cost:small impact:medium