JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase

4creators commented 4 years ago

During the design phase of JIT Hardware Intrinsics compiler bits one inefficient, temporary algorithm was allowed to be used for searching for intrinsics function names/ids. The naive search algorithm currently used works in O(n) time for each intrinsic imported, while it is possible to do this search in O(1) with very low constant overhead. Usually, this problem is hitting particularly hard in functions that use multiple intrinsics instructions since the search time is equal to t = number_of_intrinsics_used * O(n). This is particularly exacerbated in applications that require fast startup time: cloud lambda functions, command-line tools i.e. PowerShell Core.

The function implementing an algorithm which slipped to the release is the following:

https://github.com/dotnet/coreclr/blob/6de88d4f5d291269f82e3dd1aa39cee026725dfe/src/jit/hwintrinsic.cpp#L186

and the algorithm used:

https://github.com/dotnet/coreclr/blob/6de88d4f5d291269f82e3dd1aa39cee026725dfe/src/jit/hwintrinsic.cpp#L210

Previously discussed solutions included using a hashtable or/and creating fast binary preselection due to the fixed nature of search terms.

@AndyAyersMS @CarolEidt @fiigii @tannergooding

category:implementation theme:vector-codegen skill-level:beginner cost:small impact:medium

EgorBo commented 4 years ago

Just a note: in Mono we use binary search for S.R.I. intrinsics (but we have a small subset), O(1) lookup would be better I guess.

4creators commented 4 years ago

@jkotas @CarolEidt

Do you think that there is a chance to have it implemented and shipped with v3.1 and eventually backported to v3.0? The current performance of JIT with HW intrinsics is ... a bit disappointing - one can feel it using cloud functions for image processing. The alternative would be to have crossgen supporting intrinsics for R2R compilation with preset CPU architecture targets shipped with v3.1.

jkotas commented 4 years ago

3.1 is "done". Everything going into it at this has to go through approval process.
3.0 is a very short-lived release (it is not LTS). We are doing minimal changes there.

So the next steps would be:

Implement the fix and get it to master.
Once it is in master and there are data that demonstrates a large impact, we may be able to get approval to back-port it to 3.1 servicing (I am not promising anything).

CarolEidt commented 4 years ago

It is definitely worth improving the lookup for the intrinsics. That said, I'm not sure how that prioritizes against the other work that's being done in the JIT.

Supporting intrinsics in crossgen is also a reasonable thing to do (and has already been done to a limited extent for SPC.dll in dotnet/coreclr#24689). dotnet/runtime#11689 captures the remaining issue(s).

jkotas commented 4 years ago

I'm not sure how that prioritizes against the other work that's being done in the JIT.

That's where the data about the impact would come handy.

AndyAyersMS commented 4 years ago

Looks like we still don't have data measuring the actual impact of this lookup on jit throughput. I'll see if I can come up with something.

tannergooding commented 4 years ago

I would suspect the worst case is AVX2.Xor which is going to end up doing 371 checks for if isa == hwIntrinsicInfoArray[i].isa: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.cpp#L285-L335 It will then do 66 more checks where the ISA matches but the name doesn't, each of which involves a strcmp call.

We could simplify a good bit of this by removing the isa checks, which could simply be a small table that retrns the first/last NamedIntrinsic for a given ISA. That would take it down to just the strcmp calls at least.

We could then probably optimize the string checks a bit, but that would be more complex.

AndyAyersMS commented 4 years ago

Probably not a great test, but on Avx2_ro, we spend 2232 ms jitting, and somewhere around 2 ms in lookupId.

tannergooding commented 4 years ago

We also have the src/coreclr/tests/src/JIT/Performance/CodeQuality/HWIntrinsic/X86/PacketTracer benchmark if we want to test a slightly more "real world" example of intrinsic usage.

However, it sounds like there isn't a huge penalty for it overall and while we could speed it up, it isn't likely to make a noticeable impact.

Was there anything else from the HWIntrinsic specific code paths that was taking a large amount of time (I think import, codegen, and emit are the biggest; everything else is largely shared).

AndyAyersMS commented 4 years ago

Nothing really jumps out. By exclusive profile (under jitNativeCode) Inclusive: codegen is 17%, importer 14%, morph 10%, linear scan 9%, ...

AndyAyersMS commented 4 years ago

I'm going to move this to future as it doesn't seem urgent to address now.

dotnet / runtime

JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase #13617