WebAssembly / simd

Branch of the spec repo scoped to discussion of SIMD in WebAssembly
Other
531 stars 43 forks source link

LLVM Cost Tuning #404

Open omnisip opened 3 years ago

omnisip commented 3 years ago

Hi everyone,

I'm creating this ticket to keep Compiler Level Common Subexpression Elimination (#403) separate from LLVM Cost Tuning -- which is a common annoyance but is very different from the way we optimize the implementations of each individual instruction.

@tlively is suggesting we come up with a principled methodology for how we do this, so here's what I'm thinking.

Let's create a Google Sheet with every instruction we have listed on the left hand side. On the top for each column, we can put everyone's name and/or their github handle. Then in each row, they can put the reciprocal cost in tp they think should exist for each instruction.

As an example: Instruction Dan Weber (@omnisip)
v128.load 0.25
v128.load8_splat 0.75
v128.load16_splat 0.75
... ...

Then if anyone has specific comments about why they scored or costed something a certain way, they can highlight the cell and put sheet comments. For instance, I'm going to calculate the cost of my load operation based on what I would expect it to cost on x64 and ARM64 with a weighted blend. Since x64 affects me more because I use it more than ARM for WASM SIMD, I'm going to weigh it 75% and 25% for ARM64 and take the average.

As a side effect, I'll put a higher cost on i8x16.shr_s and i8x16.shr_u because no native instruction exists on x64 and they have a long sequence of instructions required to emulate the functionality.

If people are okay with this idea -- or -- have better ideas -- by all means, put them here. It might be possible for us to have more than 1 cost model with clang and WASM if we know that our audience has more of a certain device or architecture type than others.

Updated:

Here is the Google Sheet for posting your proposed model: https://docs.google.com/spreadsheets/d/1dYXjm40a2XzhjLGchlcMSdibSRvCvxUEBLHVyUc-Ivo/edit?usp=sharing

Focus first on scoring strategy and methodology, and then on individual instructions. If your proposed methodology is easy enough to calculate, it's possible someone else can help fill in the instruction costs.

omnisip commented 3 years ago

Also -- if this task becomes a little too tedious -- it might be worth spending 15 minutes on each call (#402) until we've covered the entire instruction set. Unlike CSE, this is far more subjective to personal interests than others -- LLVM seems to do a pretty good job until it does a bad one. It seems reasonable to expect that we'll probably agree on weightings for all of the shuffle/swizzle variants that are natively supported on our target platforms -- just so LLVM doesn't mess them up anymore.

tlively commented 3 years ago

For instance, I'm going to calculate the cost of my load operation based on what I would expect it to cost on x64 and ARM64 with a weighted blend.

Rather than having everyone supply the costs they think would be reasonable, we should start out by agreeing on how this blending should work. Then we can apply the same blending principle to each instruction without having to separately achieve consensus on a cost for each one.

I propose that LLVM should use the maximum of the x64 and ARM64 costs for each instruction. That way it will prefer code sequences that are ok on both platforms over code sequences that are better on one platform but worse on the other. If that sounds reasonable, the remaining question would be how to generate costs for x64 and ARM64 lowerings of each instruction.

omnisip commented 3 years ago

@tlively your comments aren't lost on me -- if we get stuck deliberating each individual instruction -- this will never get done.

On the flip side, forcing a specific approach to scoring on everyone (e.g. blending -- I was using it as an example, not a requirement) may not yield good results. Lastly, and perhaps most importantly, we're not stuck with having just one cost model -- just one default. We can implement more than one cost-model and likely have them incorporated into LLVM. These would simply be the equivalent of different -mcpu/-mtune settings based on someone's submission. The only gotcha here is that everyone who wants their cost model submitted will have to understand that they're responsible for maintaining it. In the meantime, it's probably not unrealistic to ask the LLVM folk to let us submit more than one merely for testing and evaluation purposes. They'll be understanding of the fact that we don't know what the underlying architecture is and may have good insight on how we should select a result.

Clarifying my earlier suggestion and to address some of your concerns -- I've put together a Google sheet that still goes instruction by instruction, but focuses first on scoring strategy rather than individual instruction performance. It seems most appropriate that we debate the strategy first, before the individual instructions, and allow people to highlight specific instructions that are worrisome across models. This should allow us to keep the debate on the cost models more focused.

The link to Google Sheet for additions can be found here:

https://docs.google.com/spreadsheets/d/1dYXjm40a2XzhjLGchlcMSdibSRvCvxUEBLHVyUc-Ivo/edit?usp=sharing

tlively commented 3 years ago

We certainly need a default cost model that tries to be agnostic to the underlying platform, so that's what we should think about first.

I'm not yet convinced that it is a good idea to have alternative platform-specific cost models as well. Having those would run counter to our philosophy of trying to present a portable abstraction over native SIMD instruction sets. If I become convinced that platform-specific cost models are a good idea, I would be happier if we had them for both ARM and Intel instruction sets rather than just one or the other.

I would certainly accept any experimental cost models in upstream LLVM as long as they have a clear experimental purpose and as long as the patch author intends to remove them once the experiments are done. For non-experimental cost models, I would want to get the blessing of the full SIMD group before committing to anything.

omnisip commented 3 years ago

Do you think we'll be able to arrive at a new replacement default model without testing and benchmarking?

tlively commented 3 years ago

Testing and benchmarking will probably be useful for coming up for the relative costs on instructions on individual platforms, but I think we should be able to come up with a strategy for combining platform-specific costs into platform-agnostic costs without any benchmarking. That strategy is based more on our high-level goals for the cost model.

aardappel commented 3 years ago

why not just work on estimating 2 columns, one for x64 and one for arm64, then average them later?

tlively commented 3 years ago

We should definitely get x64 and arm64 costs and combine them later. It would be good to choose how we combine them in a principled manner, though, rather than arbitrarily choosing to take the average or min or max or whatever.

Maratyszcza commented 3 years ago

I would like to note that there isn't such thing as "x86-64 instruction cost" and "ARM64 instruction cost": each microarchitecture has its own throughput and latency. So if the plan is to blend instruction characteristics across several processors, we should start with selecting relevant microarchitectures first.

penzn commented 3 years ago

I propose that LLVM should use the maximum of the x64 and ARM64 costs for each instruction. That way it will prefer code sequences that are ok on both platforms over code sequences that are better on one platform but worse on the other. If that sounds reasonable, the remaining question would be how to generate costs for x64 and ARM64 lowerings of each instruction.

I second that - "least common denominator" approach makes sense, as we don't know what the code would be running on.

The main wrinkle with that is that cost model is not just a mapping of instructions to a numeric value - there are levels of logic in various parts of the compiler, like ABI, vectorizer, instruction selection, including relative minute cost calculations in some cases. The way this logic is organized differently for different targets, with different "questions" being asked and answered. For example compare llvm/lib/Target/X86/X86TargetTransformInfo.h and llvm/lib/Target/ARM/ARMTargetTransformInfo.h.

I don't think we would be able to simply merge X86 and Arm cost models - there are lots of moving parts, but we can probably start with trying to address the things we know about - shuffle masks, comparisons, and so on.

We can implement more than one cost-model and likely have them incorporated into LLVM. These would simply be the equivalent of different -mcpu/-mtune settings based on someone's submission.

Would not this mean that somebody would be able to post a module online which would work well only on a particular device (or class of devices)?

omnisip commented 3 years ago

We can implement more than one cost-model and likely have them incorporated into LLVM. These would simply be the equivalent of different -mcpu/-mtune settings based on someone's submission.

Would not this mean that somebody would be able to post a module online which would work well only on a particular device (or class of devices)?

My comment was primarily for the purposes of allowing competitive evaluation of the models and to ensure that we didn't produce something that would produce worse optimizations than what LLVM is currently producing.