0xADE1A1DE / CryptOpt

CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives
https://0xade1a1de.github.io/CryptOpt/
Apache License 2.0
58 stars 11 forks source link

optimizing register allocation #199

Open andres-erbsen opened 1 year ago

andres-erbsen commented 1 year ago

CryptOpt code contains more spills than comparable handwritten code, and sometimes more than compiler-generated code. It is not clear whether this is important for performance, but I do have an implementation where CryptOpt is not winning over naive handwritten assembly and that exhibits this pattern.

What if register allocation first identified uses that can be unspilled using x86 memory operands, "allocated" temporaties for which all uses are of this form, and then allocated the remaining variables into registers based on the remaining uses. In particular, I believe this generic heuristic would do a decent job on the adx-multiplication pattern of mulx a0 b0; mulx a0 b1; mulx b0 b2 ... ; mulx a1 b0; ...: all b can stay in memory and only one a needs to be live at a time. This suggestion is based on the hypothesis that memory-operand access is cheaper than a separate load even if that load is amortized over multiple instructions; I am not sure whether this is always true.

Another guess for a tweak would be to consider would be spilling and restoring callee-saved registers using push/pop at the very beginning and end of the function.

Have you tried something like these options already?

dderjoel commented 1 year ago

I like the ideas. It is implemented in the way it is now because it has the least amount of special cases (I believe). Especially for the pop/push parts. The virtual CPU inside the Register Allocator initialises the callee-saved registers with values like 'caller-save-rbx'. Then, if spilling needs to occur, those are 'just handled like other values to spill (i.e. can be spilled with a normal 'mov'). Note, that those don't always need to be spilled. E.g. for poly1305-sq, a very short method that can be implemented without using any callee-saved regs.

I've used push/pop in earlier versions, when I 'just push all at the start' and 'pop'ed all just before ret'. I've not seen any performance differences of using 'pop/push' over 'mov's... In theory a mov should be cheaper, because rsp is untouched. looking at https://uops.info/table.html, that does not seem to matter too much, as pop only issues one µ-op. However, I think there still may be benefits of not changing rsp, (I'm thinking of out-of-order loads, store-to-load-forwarding, cache predictions, better register renaming, ...) But that's all just guesses at this point. If one would want to re-introduce pop/push behavior, that shouldn't be too much work, would just need to make sure that rsp internally is kept track of in terms of red/zone and spilling of other variables.

The spilling happens here. When Spilling, caller-save-rbx is preferred to be spilled than anything else. then try to spill constants (ie. immediate values like 0x...), then try to spill arg1[n], because one can read them at any time from mem again. then choose the spill value based on the current order RA called Model, Model chooses the value with the highest index.

So let me rephrase what I understand from your earlier paragraph: The heuristic would be to prefer spills of variables, which can be used by instructions that can read from memory.

Example:

Choose one of x1, x2, x3, x4 to spill, where the rest of the program looks like this:

x10 = x2 * x1;
x11 = x3 + x4;

In the current situation, CrypOpt would choose x3 or x4 because they are read later. And with the heuristic that you'd propose, it should spill x1, because it can be read from memory with mulx -,-,[x1]?

So essentially, beyond only checking when a spillable value is used, add the information on how a value is used. I think it gets a bit more complicated than this because what effect should be weighted how? Say x1 is read in the next 3 operations as a factor in a product (high likelihood of spilling), but also past that in the next 4 operations as a summand in a sum. (lower the likelihood of spilling). The example is a bit abstract, but I just wonder if there is an easy approach to this.

Maybe you can give your code as an example.

andres-erbsen commented 1 year ago

I can believe push/pop and mov saves are pretty much interchangeable given how common both are.

I am not sure whether the other idea I proposed can probably be reasonably approximated as a spill-selection model, but perhaps we can do decently well if the model has access to the current register allocation right before the spill. Specifically I wonder whether it would work to rank all spillable registers based on a handwavy estimate of how many instructions of reloads they will require later.

Here is a sketch of a heuristic, recursing forward on the remaining program and considering every operation:

The specific behavior I am looking to achieve is that a0 would be kept in rdx throughout the computation (and chained accumulation) of a0*b0, a0*b1, a0*b2, ... whereas b0 and so on would be accessed as memory operands to mulx. This pattern also appears in the Curve25519 code we compared against; CryptOpt champion implementations can be less clever.

I am not sure whether considering this heuristic during spilling would be enough, though. There's also the question of how a0 would end up in rdx in the first place. And of course instruction scheduling is random by design, but I hope that if register allocation was more often fast on decently scheduled code we'd get better results anyway.

dderjoel commented 1 year ago

I see. First, I think there are two things that we're mixing up here. First, which value to put into rdx in those situations and second, which value to spill if need be. Those are two distinct sections in CryptOpt.

First, which value to put into rdx:

Did you change the code in the cryptopt-generated file 'accordingly' and see if it actually is faster? I get your point and I agree that there are some movs that can be eliminated, but they are very close together. and because it is this mulx specialty with rdx that I feel they don't matter too much. In the end, looking at this stretch, we have four mem * mem operations, which (on a µ-op level) need to load the data from memory four times, regardless in which actual register they are (they'll get renamed anyway).

That being said, it is already part of the optimizer, which value to load into rdx. So it has the potential to load a0 into rdx in the first mulx and then it would keep it in there for the next mulx. Yes, it could potentially be done 'more statically', but I feel we should only try that if we 'have evidence' that this is worth it, at least on several architectures... (btw, I don't have access to any of the machines anymore, so I couldn't test it)

Re your spilling heuristic. I'm not sure I understand. The model has access to the current Register Allocation. I am looking at all values currently held in gp-registers. Then I filter those out, that I need in the current operation (because I should not spill those that I need at this moment) Then I look forward into the program, and sort the remaining registers by when they are read first in any upcoming operation. The one value is spill that has the biggest distance to the current position.

Now I could think of a different ranking system: Sorting like sketched above, but then change the raking slightly: If a value is used by an operation, which can be implemented with an instruction that can encode a memory operand then it gets +1 added to its position, making it more likely to spill.

However, that is easier said than done, because CryptOpt chooses instructions base on where the data (mem or reg) is. And those instructions are selected at the moment when an operation is being implemented. So the look-ahead can only 'see' the operations. It does make sense for the mulx example, but I'm not too sure of the other operations.

andres-erbsen commented 1 year ago

Thank you for explaining this circular dependency between instruction selection and register allocation. The model I proposed was substantially different from what you have been describing, and is substantially thwarted by this pass ordering.

That being said, it is already part of the optimizer, which value to load into rdx. So it has the potential to load a0 into rdx in the first mulx and then it would keep it in there for the next mulx.

Hmm, indeed. And I think I have seen cases where this happens for a two or three instructions, but never not an entire row.

Here's the code I have been working with: p256.zip

I started from a 56x~1M cryptopt champion and wrangled mulx/adx usage back into the straightforward alternating pattern, tidying up registers on the fly. The simplified code is 5-10% faster depending on usage pattern. And I am pretty sure I am still using one register too many, needlessly spilling rdi, and missing some more clever optimizations.

I agree with your uop-level analysis, but I wonder whether we are overlooking some bottleneck in the instruction chunking or decode phase. I sometimes got the sense that leaving in a redundant instruction made the implementation faster.

andreser@andreser ~ % env CC=clang taskset -c 0 ~/CryptOpt/bins/node/bin/node ~/CryptOpt/dist/CountCycle.js seed3479777052241195_ratio17480.asm ~/p256.json ~/p256.c
mmap (for performance counter) failed: Bad file descriptor
18805.5 21814.5
andreser@andreser ~ % env CC=clang taskset -c 0 ~/CryptOpt/bins/node/bin/node ~/CryptOpt/dist/CountCycle.js seed3479777052241195_ratio99.asm ~/p256.json ~/p256.c
mmap (for performance counter) failed: Bad file descriptor
17705 21810
andreser@andreser ~ % clang -O3 -march=native bench_p256_indep.c p256cryptopt.s && time taskset -c 1 ./a.out
/usr/bin/ld: warning: /tmp/p256cryptopt-c22b24.o: missing .note.GNU-stack section implies executable stack
/usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
taskset -c 1 ./a.out  0.57s user 0.00s system 99% cpu 0.571 total
andreser@andreser ~ % clang -O3 -march=native bench_p256_indep.c p256touchup.s && time taskset -c 1 ./a.out
/usr/bin/ld: warning: /tmp/p256touchup-390954.o: missing .note.GNU-stack section implies executable stack
/usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
taskset -c 1 ./a.out  0.52s user 0.00s system 99% cpu 0.521 total
andreser@andreser ~ % python
Python 3.11.5 (main, Aug 25 2023, 07:43:52) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> 18805.5/17705
1.0621575826037843
>>> 0.57/0.52
1.096153846153846

No proofs yet, just passes tests.

andres-erbsen commented 1 year ago

Correction: the above cycle counts were not measured on skylake (on which the cryptopt optimization ran), here's actual skylake numbers:

19958.75 24510.5 results/manual/p256_mul/seed3479777052241195_ratio99.asm
21725 24752 results/manual/p256_mul/seed3479777052241195_ratio17480.asm
dderjoel commented 1 year ago

I mean, this register allocation trick would be rather easy to implement, if we'd only focus on mulx for now.

Let me pseudocode this real quick.

implementing operation `o`:
    // now choose o.a1 or o.a1 to load to rdx.
    // initialize counters for both of those args:
    a1=0;
    a2=0;
    check all upcoming `mulx` operations `o'` for:
        if `o'` also has `o.a1` as an argument, increment `a1`
        if `o'` also has `o.a2` as an argument, increment `a2`
    If a1 == a2, choose randomly, otherwise choose the argument with the higher ranking.

Do you think that could work? I wonder if the distance to the current instruction should be taken into account. I've sketched it in branch feature/reg199. If you build with DEBUG=1 make and run with --verbose (probably use --eval 100 or something smaller, too) you should see comments in the asm file that start with ; chooseMulxLoadValue upcoming mulxs:

The implementation seems to do the job, see below for a snippet from the my current file, after just a handful of evaluations ($ ./CryptOpt --bridge manual --cFile /tmp/p256.c --jsonFile /tmp/p256.json --seed 2 --verbose --evals 100 --single)

SECTION .text
    GLOBAL p256_mul
p256_mul:
sub rsp, 264
; rdi contains out1
; rsi contains arg1
; rdx contains arg2

; Operation: x1--x2<-mulx(arg1[0], arg2[0]), 
; chooseMulxLoadValue upcoming mulxs: x1-x2, x41-x42, x99-x100, x32-x33, x7-x8, x68-x69, x3-x4, x72-x73, x111-x112, x105-x106, x51-x52, x16-x17, x26-x27, x20-x21, x95-x96, x11-x12, x45-x46, x57-x58, x132-x133, x122-x123, x84-x85, x78-x79, x138-x139, x126-x127 counters {"arg1[0]":4,"arg2[0]":4} only two candidates, and they both have the same count value of 4. returning null.
;-- allocation: , calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, calSv-rbp rbp, calSv-rbx rbx, out1 rdi, arg2 rdx, arg2[0] rdx, arg1 rsi
; fr: rax,r10,r11,rcx,r8,r9,rcx,r8,r9
mov rax, rdx; preserving value of arg2 into a new reg
mov rdx, [ rdx + 0x0 ]; saving arg2[0] in rdx.
; currently considered reusable: 
;-- allocation: , calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, out1 rdi, arg2[0] rdx, arg1 rsi
; fr: r10,r11,rcx,r8,r9,rcx,r8,r9
;-- allocation: , x1 r10, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, out1 rdi, arg2[0] rdx, arg1 rsi
; fr: r11,rcx,r8,r9,rcx,r8,r9
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, out1 rdi, arg2[0] rdx, arg1 rsi
mulx r11, r10, [ rsi + 0x0 ]; hix2, lox1<- arg1[0] * arg2[0]

; Operation: x41--x42<-mulx(0x100000000, x1), 
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, out1 rdi, arg1 rsi
; fr: rdx,rcx,r8,r9,rcx,r8,r9
mov rdx, 0x100000000 ; moving imm to reg
; currently considered reusable: 
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, out1 rdi, 0x100000000 rdx, arg1 rsi
; fr: rcx,r8,r9,rcx,r8,r9
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, x41 rcx, out1 rdi, 0x100000000 rdx, arg1 rsi
; fr: r8,r9,r8,r9
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, x41 rcx, out1 rdi, 0x100000000 rdx, arg1 rsi
mulx r8, rcx, r10; hix42, lox41<- 0x100000000 * x1

; Operation: x99--x100<-mulx(arg1[3], arg2[1]), 
; chooseMulxLoadValue upcoming mulxs: x99-x100, x32-x33, x7-x8, x68-x69, x3-x4, x72-x73, x111-x112, x105-x106, x51-x52, x16-x17, x26-x27, x20-x21, x95-x96, x11-x12, x45-x46, x57-x58, x132-x133, x122-x123, x84-x85, x78-x79, x138-x139, x126-x127 counters {"arg1[3]":4,"arg2[1]":4} only two candidates, and they both have the same count value of 4. returning null.
mov rdx, [ rax + 0x8 ]; arg2[1] to rdx
; currently considered reusable: 
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, x41 rcx, out1 rdi, arg2[1] rdx, arg1 rsi
; fr: r9,r9
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, x41 rcx, out1 rdi, arg2[1] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling calSv-rbx, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,calSv-rbx(rbx) ,calSv-rbp(rbp) ,calSv-r12(r12) ,calSv-r13(r13) ,calSv-r14(r14) ,calSv-r15(r15) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,arg2[1](rdx) ,x99(r9)
; clobs "x99, x100, arg1[3], arg1, arg2[1], arg2, x99_0, x99_1, x100_0, x100_1"; will spare: calSv-rbx 
; stacking up for calSv-rbx
mov [ rsp - 0x80 ], rbx; spilling calSv-rbx to mem
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, calSv-rbp rbp, x100 rbx, x41 rcx, out1 rdi, arg2[1] rdx, arg1 rsi
mulx rbx, r9, [ rsi + 0x18 ]; hix100, lox99<- arg1[3] * arg2[1]

; Operation: x32--x33<-mulx(arg1[1], arg2[3]), 
; chooseMulxLoadValue upcoming mulxs: x32-x33, x7-x8, x68-x69, x3-x4, x72-x73, x111-x112, x105-x106, x51-x52, x16-x17, x26-x27, x20-x21, x95-x96, x11-x12, x45-x46, x57-x58, x132-x133, x122-x123, x84-x85, x78-x79, x138-x139, x126-x127 counters {"arg1[1]":4,"arg2[3]":4} only two candidates, and they both have the same count value of 4. returning null.
mov rdx, [ rax + 0x18 ]; arg2[3] to rdx
; currently considered reusable: 
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, calSv-rbp rbp, x100 rbx, x41 rcx, out1 rdi, arg2[3] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling calSv-rbp, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,calSv-rbp(rbp) ,calSv-r12(r12) ,calSv-r13(r13) ,calSv-r14(r14) ,calSv-r15(r15) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,arg2[3](rdx)
; clobs "x32, x33, arg1[1], arg1, arg2[3], arg2, x32_0, x32_1, x33_0, x33_1"; will spare: calSv-rbp 
; stacking up for calSv-rbp
mov [ rsp - 0x78 ], rbp; spilling calSv-rbp to mem
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg2[3] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling calSv-r12, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,calSv-r12(r12) ,calSv-r13(r13) ,calSv-r14(r14) ,calSv-r15(r15) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,arg2[3](rdx) ,x32(rbp)
; clobs "x32, x33, arg1[1], arg1, arg2[3], arg2, x32_0, x32_1, x33_0, x33_1"; will spare: calSv-r12 
; stacking up for calSv-r12
mov [ rsp - 0x70 ], r12; spilling calSv-r12 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg2[3] rdx, arg1 rsi
mulx r12, rbp, [ rsi + 0x8 ]; hix33, lox32<- arg1[1] * arg2[3]

; Operation: x7--x8<-mulx(arg1[0], arg2[2]), 
; chooseMulxLoadValue upcoming mulxs: x7-x8, x68-x69, x3-x4, x72-x73, x111-x112, x105-x106, x51-x52, x16-x17, x26-x27, x20-x21, x95-x96, x11-x12, x45-x46, x57-x58, x132-x133, x122-x123, x84-x85, x78-x79, x138-x139, x126-x127 counters {"arg1[0]":3,"arg2[2]":4} choosing arg2[2] because of its higher count value of 4
mov rdx, [ rax + 0x10 ]; arg2[2] to rdx
; currently considered reusable: 
;-- allocation: , x1 r10, x2 r11, x33 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg2[2] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling calSv-r13, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,calSv-r13(r13) ,calSv-r14(r14) ,calSv-r15(r15) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,x32(rbp) ,x33(r12) ,arg2[2](rdx)
; clobs "x7, x8, arg1[0], arg1, arg2[2], arg2, x7_0, x7_1, x8_0, x8_1"; will spare: calSv-r13 
; stacking up for calSv-r13
mov [ rsp - 0x68 ], r13; spilling calSv-r13 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg2[2] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling calSv-r14, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,calSv-r14(r14) ,calSv-r15(r15) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,x32(rbp) ,x33(r12) ,arg2[2](rdx) ,x7(r13)
; clobs "x7, x8, arg1[0], arg1, arg2[2], arg2, x7_0, x7_1, x8_0, x8_1"; will spare: calSv-r14 
; stacking up for calSv-r14
mov [ rsp - 0x60 ], r14; spilling calSv-r14 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg2[2] rdx, arg1 rsi
mulx r14, r13, [ rsi + 0x0 ]; hix8, lox7<- arg1[0] * arg2[2]

; Operation: x68--x69<-mulx(arg1[2], arg2[0]), 
; chooseMulxLoadValue upcoming mulxs: x68-x69, x3-x4, x72-x73, x111-x112, x105-x106, x51-x52, x16-x17, x26-x27, x20-x21, x95-x96, x11-x12, x45-x46, x57-x58, x132-x133, x122-x123, x84-x85, x78-x79, x138-x139, x126-x127 counters {"arg1[2]":4,"arg2[0]":3} choosing arg1[2] because of its higher count value of 4
mov rdx, [ rsi + 0x10 ]; arg1[2] to rdx
; currently considered reusable: 
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg1[2] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling calSv-r15, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,calSv-r15(r15) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,x32(rbp) ,x33(r12) ,x7(r13) ,x8(r14) ,arg1[2](rdx)
; clobs "x68, x69, arg1[2], arg1, arg2[0], arg2, x68_0, x68_1, x69_0, x69_1"; will spare: calSv-r15 
; stacking up for calSv-r15
mov [ rsp - 0x58 ], r15; spilling calSv-r15 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, x68 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg1[2] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling out1, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,x32(rbp) ,x33(r12) ,x7(r13) ,x8(r14) ,arg1[2](rdx) ,x68(r15)
; clobs "x68, x69, arg1[2], arg1, arg2[0], arg2, x68_0, x68_1, x69_0, x69_1"; will spare: out1 
; stacking up for out1
mov [ rsp - 0x50 ], rdi; spilling out1 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, x68 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, x69 rdi, arg1[2] rdx, arg1 rsi
mulx rdi, r15, [ rax + 0x0 ]; hix69, lox68<- arg1[2] * arg2[0]

; Operation: x3--x4<-mulx(arg1[0], arg2[1]), 
; chooseMulxLoadValue upcoming mulxs: x3-x4, x72-x73, x111-x112, x105-x106, x51-x52, x16-x17, x26-x27, x20-x21, x95-x96, x11-x12, x45-x46, x57-x58, x132-x133, x122-x123, x84-x85, x78-x79, x138-x139, x126-x127 counters {"arg1[0]":2,"arg2[1]":3} choosing arg2[1] because of its higher count value of 3
mov rdx, [ rax + 0x8 ]; arg2[1] to rdx
; currently considered reusable: 
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, x68 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, x69 rdi, arg2[1] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling x100, because I am out of ideas
; allocs: arg1(rsi) ,arg2(rax) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,x32(rbp) ,x33(r12) ,x7(r13) ,x8(r14) ,x68(r15) ,x69(rdi) ,arg2[1](rdx)
; clobs "x3, x4, arg1[0], arg1, arg2[1], arg2, x3_0, x3_1, x4_0, x4_1"; will spare: x100 
; stacking up for x100
mov [ rsp - 0x48 ], rbx; spilling x100 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, x68 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x3 rbx, x41 rcx, x69 rdi, arg2[1] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling x99, because I am out of ideas
; allocs: arg1(rsi) ,arg2(rax) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x32(rbp) ,x33(r12) ,x7(r13) ,x8(r14) ,x68(r15) ,x69(rdi) ,arg2[1](rdx) ,x3(rbx)
; clobs "x3, x4, arg1[0], arg1, arg2[1], arg2, x3_0, x3_1, x4_0, x4_1"; will spare: x99 
; stacking up for x99
mov [ rsp - 0x40 ], r9; spilling x99 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, x68 r15, x42 r8, x4 r9, arg2 rax, x32 rbp, x3 rbx, x41 rcx, x69 rdi, arg2[1] rdx, arg1 rsi
mulx r9, rbx, [ rsi + 0x0 ]; hix4, lox3<- arg1[0] * arg2[1]
andres-erbsen commented 1 year ago

Thanks, looks good at the first sight. I kicked off a new run.

andres-erbsen commented 1 year ago

This new optimization closed most of the gap. I tried to refine it further, but hit a snag.

commit dbc4593a4dba42b6e2bf86a52b24f0b29a8025b8 (HEAD)
Author: Andres Erbsen <andres-github@andres.systems>
Date:   Fri Sep 15 18:01:24 2023 +0000

    tweak mulx load value selection

diff --git a/src/model/model.class.ts b/src/model/model.class.ts
index b9e1d7b..f723158 100644
--- a/src/model/model.class.ts
+++ b/src/model/model.class.ts
@@ -219,7 +219,7 @@ export class Model {
     }

     // Idea is:
-    // for each candidate, count how many times it is used as an argument in the upcoming mulx operations
+    // for each candidate, count how many consecutive upcoming mulx operations use it

     // initalise counters:
     const counters: { [candidateName: string]: number } = candidates.reduce(
@@ -231,11 +231,11 @@ export class Model {
       .filter((op) => op.operation == "mulx");
     msg += ` upcoming mulxs: ${upcomingMulxOps.map((m) => m.name.join("-")).join(", ")}`;

-    upcomingMulxOps.forEach((op) => {
-      op.arguments.forEach((arg) => {
-        if (arg in counters) {
-          counters[arg]++;
-        }
+    candidates.forEach((arg) => {
+      upcomingMulxOps.every((op) => {
+       if (!op.arguments.includes(arg)) { return false; }
+       counters[arg]++;
+       return true;
       });
     });
     msg += ` counters ${JSON.stringify(counters)}`;
diff --git a/src/registerAllocator/RegisterAllocator.class.ts b/src/registerAllocator/RegisterAllocator.class.ts
index 9d604a2..ca784ba 100644
--- a/src/registerAllocator/RegisterAllocator.class.ts
+++ b/src/registerAllocator/RegisterAllocator.class.ts
@@ -593,7 +593,7 @@ export class RegisterAllocator {
       }
     }

-    if (caf(AllocationFlags.ONE_IN_MUST_BE_IN_RDX) && !inAllocations.includes(Register.rdx)) {
+    if (caf(AllocationFlags.ONE_IN_MUST_BE_IN_RDX) && this.getVarnameFromStore({ store: Register.rdx }, false) != "" && !inAllocations.includes(this.getVarnameFromStore({ store: Register.rdx }, true))) {
       // since in inAllocations there is no rdx (otherwise we wont be in this branch)
       // we need to move one of inAllocations to rdx

@@ -751,6 +751,8 @@ export class RegisterAllocator {
         (inA, i) =>
           isRegister(inA) &&
           !this._clobbers.has(this.getVarnameFromStore({ store: inA })) &&
+         // Error: Tried to find the name of data being stored at rdx, but found 0: [] instead of ONE in the current Allocations. TSNH. Abort.
+         // ignoring the error generates incorrect assembly
           !Model.hasDependants(allocationReq.in[i]),
       ) as Register[];
       this.addToPreInstructions(Logger.log(`; currently considered reusable: ${reusable.join(", ")}`) ?? "");

Without the middle change, !inAllocations.includes(Register.rdx) is always true even if inAllocations includes "arg1[0]" and the store of "arg1[0]" is "rdx". Changing this unearthed issues that I do not understand in the following code.

I also encountered the following beauty which is likely also responsible for some missed optimization of mulx/adx chains:

; Operation: x41--x42<-mulx(0x100000000, x1), 
;-- allocation: , x14 OF, x16 r10, x17 r11, x68 r12, x5 r13, x4 r14, x9 r15, x2 r8, x12 r9, arg2 rax, x13 rbp, 0x0 rbx, x51 rcx, x8 rdi, x1 rdx, arg1 rsi
; fr: none.
; freeing x2 (r8) no dependants anymore
mov r8, 0x100000000 ; moving imm to reg
; chooseMulxLoadValue upcoming mulxs: x41-x42, x78-x79, x99-x100, x26-x27, x45-x46, x57-x58, x122-x123, x132-x133, x138-x139, x126-x127 counters {"0x100000000":1,"x1":1} only two candidates, and they both have the same count value of 1. returning null.
mov rdx, rdx; x1 to rdx
; currently considered reusable: rdx    
;-- allocation: , x14 OF, x16 r10, x17 r11, x68 r12, x5 r13, x4 r14, x9 r15, 0x100000000 r8, x12 r9, arg2 rax, x13 rbp, 0x0 rbx, x51 rcx, x8 rdi, x41 rdx, arg1 rsi
; fr: none.
; freeing x8 (rdi) no dependants anymore
;-- allocation: , x14 OF, x16 r10, x17 r11, x68 r12, x5 r13, x4 r14, x9 r15, 0x100000000 r8, x12 r9, arg2 rax, x13 rbp, 0x0 rbx, x51 rcx, x42 rdi, x41 rdx, arg1 rsi
mulx rdi, rdx, r8; hix42, lox41<- 0x100000000 * x1
dderjoel commented 1 year ago

I can't understand what you're trying to do.

andres-erbsen commented 1 year ago

1) I changed the heuristic to only count consecutive mulx operations. This is to ensure that the very first mulx gets the right spilling choice even though both lhs and rhs are used for mulx 4 times total. every is there just to get a foreach loop with a break in javascript.

2) I observed that sometimes rdx gets reloaded even though it has the right value already. I believe that is a bug, specifically in inAllocations.includes(Register.rdx)). inAllocations seems to contain variable names, not register names, thus checking it for "rdx" is effectively a constant operation. I changed the conditional to what I thought it should be, but then I got crashes ("Error" comment), and if I made that line not throw, I got bad assembly code.

dderjoel commented 1 year ago
  1. I see. very clever with the every, must say. I've added some explanation so I can understand that myself next week :)

  2. inAllocations is not the same as allocationReq.in. (See comments here, I admit the naming is not optimal...) If I read my code right, then inAllocations is set in here to essentially inAllocationsTemp, which itself is set here I've double checked, with a console.warn('inAllocations: ' + inAllocations.join("-") ); after this and it actually has either registers or memory references. E.g.:

    inAllocations: [ rsi + 0x0 ]-[ rax + 0x18 ]
    inAllocations: rbx-r12
    inAllocations: [ rsi + 0x10 ]-[ rax + 0x18 ]
    inAllocations: r9-r11
    inAllocations: [ rsi + 0x0 ]-[ rdx + 0x0 ]
    inAllocations: [ rsi + 0x18 ]-[ rax + 0x8 ]

So if I understand you correctly, with 2. you're trying to avoid the mov rdx, rdx instructions, right? I can't reproduce that one. I've put a

       if (output.some((o) => o.match(/mov rdx, rdx.*/))) {
        throw Error("suspicious mov detected");
      }

after output.push(...ins); here and ran with --bridge manual --cFile /tmp/p256.c --jsonFile /tmp/p256.json --seed 2 --evals 100k --single, but it did not throw at all...

On a very different note, could x41--x42<-mulx(0x100000000, x1), be a shift (shrd) as well?

andres-erbsen commented 1 year ago

with 2. you're trying to avoid the mov rdx, rdx instructions

No, I hadn't started on that yet. The attempt in my change was to avoid redundant loads of in-memory inputs already cached in rdx.

As for the redundant move after the constant loading, I am not sure what code generates it.

Yes, that mulx could be done using shifts. I have not investigated relative performance.

andres-erbsen commented 1 year ago

This is the scenario that I was trying to optimize:

env CC=clang ./CryptOpt --bridge manual --cFile ~/p256.c --jsonFile ~/p256.json --verbose --evals 1 --single --seed 34

; Operation: x99--x100<-mulx(arg1[3], arg2[1]), 
; chooseMulxLoadValue upcoming mulxs: x99-x100, x105-x106, x51-x52, x84-x85, x11-x12, x78-x79, x41-x42, x45-x46, x57-x58, x132-x133, x122-x123, x138-x139, x126-x127 counters {"arg1[3]":2,"arg2[1]":1} choosing arg1[3] because of its higher count value of 2
mov rdx, [ rsi + 0x18 ]; arg1[3] to rdx
; currently considered reusable: 
;-- allocation: , x1 r10, x2 r11, x68 r12, x21 r13, x20 r14, x17 r15, x8 r8, x3 r9, arg2 rax, x69 rbp, x4 rbx, x7 rcx, x16 rdi, arg1[3] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling x69, because I am out of ideas
; allocs: arg1(rsi) ,arg2(rax) ,x7(rcx) ,x8(r8) ,x3(r9) ,x4(rbx) ,x20(r14) ,x21(r13) ,x1(r10) ,x2(r11) ,x16(rdi) ,x17(r15) ,x68(r12) ,x69(rbp) ,arg1[3](rdx)
; clobs "x99, x100, arg1[3], arg1, arg2[1], arg2, x99_0, x99_1, x100_0, x100_1"; will spare: x69 
; stacking up for x69
mov [ rsp + 0x8 ], rbp; spilling x69 to mem
;-- allocation: , x1 r10, x2 r11, x68 r12, x21 r13, x20 r14, x17 r15, x8 r8, x3 r9, arg2 rax, x99 rbp, x4 rbx, x7 rcx, x16 rdi, arg1[3] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling x68, because I am out of ideas
; allocs: arg1(rsi) ,arg2(rax) ,x7(rcx) ,x8(r8) ,x3(r9) ,x4(rbx) ,x20(r14) ,x21(r13) ,x1(r10) ,x2(r11) ,x16(rdi) ,x17(r15) ,x68(r12) ,arg1[3](rdx) ,x99(rbp)
; clobs "x99, x100, arg1[3], arg1, arg2[1], arg2, x99_0, x99_1, x100_0, x100_1"; will spare: x68 
; stacking up for x68
mov [ rsp + 0x10 ], r12; spilling x68 to mem
;-- allocation: , x1 r10, x2 r11, x100 r12, x21 r13, x20 r14, x17 r15, x8 r8, x3 r9, arg2 rax, x99 rbp, x4 rbx, x7 rcx, x16 rdi, arg1[3] rdx, arg1 rsi
mulx r12, rbp, [ rax + 0x8 ]; hix100, lox99<- arg1[3] * arg2[1]

; Operation: x105--x106<-mulx(arg1[3], arg2[2]), 
; chooseMulxLoadValue upcoming mulxs: x105-x106, x51-x52, x84-x85, x11-x12, x78-x79, x41-x42, x45-x46, x57-x58, x132-x133, x122-x123, x138-x139, x126-x127 counters {"arg1[3]":1,"arg2[2]":1} only two candidates, and they both have the same count value of 1. returning null.
mov rdx, [ rax + 0x10 ]; arg2[2] to rdx
; currently considered reusable: 
;-- allocation: , x1 r10, x2 r11, x100 r12, x21 r13, x20 r14, x17 r15, x8 r8, x3 r9, arg2 rax, x99 rbp, x4 rbx, x7 rcx, x16 rdi, arg2[2] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling x100, because I am out of ideas
; allocs: arg1(rsi) ,arg2(rax) ,x7(rcx) ,x8(r8) ,x3(r9) ,x4(rbx) ,x20(r14) ,x21(r13) ,x1(r10) ,x2(r11) ,x16(rdi) ,x17(r15) ,x99(rbp) ,x100(r12) ,arg2[2](rdx)
; clobs "x105, x106, arg1[3], arg1, arg2[2], arg2, x105_0, x105_1, x106_0, x106_1"; will spare: x100 
; stacking up for x100
mov [ rsp + 0x18 ], r12; spilling x100 to mem
;-- allocation: , x1 r10, x2 r11, x105 r12, x21 r13, x20 r14, x17 r15, x8 r8, x3 r9, arg2 rax, x99 rbp, x4 rbx, x7 rcx, x16 rdi, arg2[2] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling x99, because I am out of ideas
; allocs: arg1(rsi) ,arg2(rax) ,x7(rcx) ,x8(r8) ,x3(r9) ,x4(rbx) ,x20(r14) ,x21(r13) ,x1(r10) ,x2(r11) ,x16(rdi) ,x17(r15) ,x99(rbp) ,arg2[2](rdx) ,x105(r12)
; clobs "x105, x106, arg1[3], arg1, arg2[2], arg2, x105_0, x105_1, x106_0, x106_1"; will spare: x99 
; stacking up for x99
mov [ rsp + 0x20 ], rbp; spilling x99 to mem
;-- allocation: , x1 r10, x2 r11, x105 r12, x21 r13, x20 r14, x17 r15, x8 r8, x3 r9, arg2 rax, x106 rbp, x4 rbx, x7 rcx, x16 rdi, arg2[2] rdx, arg1 rsi
mulx rbp, r12, [ rsi + 0x18 ]; hix106, lox105<- arg1[3] * arg2[2]
version info ```diff andreser@andreser ~/CryptOpt % git log -1 HEAD~ commit b9608a038c9c6b92d81a4fefcc8e57add6cac778 Merge: b92a58c 31e2494 Author: Joel Date: Fri Sep 15 10:02:39 2023 +0930 Merge branch 'dev' into feature/reg199 andreser@andreser ~/CryptOpt % git diff HEAD~ diff --git a/src/model/model.class.ts b/src/model/model.class.ts index b9e1d7b..f723158 100644 --- a/src/model/model.class.ts +++ b/src/model/model.class.ts @@ -219,7 +219,7 @@ export class Model { } // Idea is: - // for each candidate, count how many times it is used as an argument in the upcoming mulx operations + // for each candidate, count how many consecutive upcoming mulx operations use it // initalise counters: const counters: { [candidateName: string]: number } = candidates.reduce( @@ -231,11 +231,11 @@ export class Model { .filter((op) => op.operation == "mulx"); msg += ` upcoming mulxs: ${upcomingMulxOps.map((m) => m.name.join("-")).join(", ")}`; - upcomingMulxOps.forEach((op) => { - op.arguments.forEach((arg) => { - if (arg in counters) { - counters[arg]++; - } + candidates.forEach((arg) => { + upcomingMulxOps.every((op) => { + if (!op.arguments.includes(arg)) { return false; } + counters[arg]++; + return true; }); }); msg += ` counters ${JSON.stringify(counters)}`; ```
andres-erbsen commented 1 year ago

And the only examples with the constant-loading sillyness are from CryptOpt with my change trying to fix the issue in my last comment, so it's likely that I just made matters worse there and baseline CryptOpt does not have this issue.

andres-erbsen commented 1 year ago

It looks like this line forgets that rdx already contains the correct input, this triggering the code that picks which input should be loaded again.

andres-erbsen commented 1 year ago

With the linked PR, I encountered the following sillyness:

mov r8, 0xffffffff00000001 ; moving imm to reg
mov rdx, r8; 0xffffffff00000001 to rdx
mulx r11, r8, rcx; hix52, lox51<- 0xffffffff00000001 * x1
dderjoel commented 1 year ago

I'll look at this and the PR with the start of next week (04 October or so)

andres-erbsen commented 1 year ago

With the linked PR, I encountered the following sillyness:

mov r8, 0xffffffff00000001 ; moving imm to reg
mov rdx, r8; 0xffffffff00000001 to rdx
mulx r11, r8, rcx; hix52, lox51<- 0xffffffff00000001 * x1

ironically, manually editing this out can.make the program faster or slower