Opportunity for combining add and lduw

tkoenig1 commented 1 year ago

Mitch mentioned on comp.arch that reducing code size for embench (especially with compared to RISC-V) is of interest, so I thought I'd take a look at the generated code form that benchmark.

I noticed code in nettle-aes.c which could be improved for instruction count and code size. Here is a reduced test case:

struct {
  int table[4][100]
} * _nettle_aes_decrypt_T;
_nettle_aes_decrypt_w1;
_nettle_aes_decrypt() {
  _nettle_aes_decrypt_T->table[2][0] =
      _nettle_aes_decrypt_T->table[2][_nettle_aes_decrypt_w1 >> 6 & 5];
}

gives me

_nettle_aes_decrypt:                    ; @_nettle_aes_decrypt
; %bb.0:                                ; %entry
        ldd     r1,[ip,_nettle_aes_decrypt_T]
        lduw    r2,[ip,_nettle_aes_decrypt_w1]
        add     r3,r1,#800
        srl     r2,r2,<0:4>
        and     r2,r2,#20
        add     r2,r3,r2
        lduw    r2,[r2]
        stw     r2,[r1,800]
        mov     r1,#0
        ret

where the add r2,r3,r2 and the lduw r2,[r2] could be combined to a single indexed load.

The reduction was done with a little Perl script, with

#! /usr/bin/perl

$stem = "nettle_preprocessed";
$opt = $stem . "_opt.bc";
$opt_s = $stem . "_opt.s";

$rc = system("clang -fverbose-asm -c --target=my66000 -O3 -fno-vectorize -fno-slp-vectorize  -emit-llvm -fno-unroll-loops -fomit-frame-pointer $stem.c");
exit 1 if ($rc);
$rc = system("opt  -disable-loop-unrolling -O3  --march=my66000 --frame-pointer=none --enable-vvm $stem.bc > $opt");
exit 1 if $rc;
$rc = system("llc -O2 --disable-lsr --enable-predication --enable-predication2 --enable-carry-generation --early-carry-coalesce --enable-vvm -march=my66000 $opt");
exit 1 if $rc;

open (ASM,$opt_s) or die "Cannot open $opt_s: $!";

while (<ASM>) {
    next unless /\s+add\s+(r[0-9]+),(r[0-9]+),(r[0-9]+)/;
    $r1 = $1; $r2 = $2; $r3 = $3;
    next unless $r1 eq $r2 || $r1 eq $r3;
    $a = $_;
    $next = <ASM>;
    if ($next =~ /\s+ld[a-z]+\s+(r[0-9]+),\[(r[0-9]+)\]/) {
        exit 0 if $1 eq $r1;
    }
}

exit 1;

Compiler version was clang version 15.0.0 (https://github.com/bagel99/llvm-my66000.git 5a26c794e6a7bbe1692c2c4c4399b95ef8124c16), Release build.

tkoenig1 commented 1 year ago

It could be further combined into (unless I get this wrong)

        lduw  r2,[r1,r2,800]

saving one instruction and one register (r3), but not code size.

bagel99 commented 1 year ago

Yes, I think you are right. I have a version (which needs more testing) that generates: _nettle_aes_decrypt: ; @_nettle_aes_decrypt ldd r1,[ip,_nettle_aes_decrypt_T] lduw r2,[ip,_nettle_aes_decrypt_w1] srl r2,r2,<0:4> and r2,r2,#20 lduw r2,[r1,r2,800] stw r2,[r1,800] ret

bagel99 commented 1 year ago

Try again with commit af7ee91ac91db0ac489c686eb35daf15ceb1f32f

tkoenig1 commented 1 year ago

Code generated now is

        ldd     r1,[ip,_nettle_aes_decrypt_T]
        lduw    r2,[ip,_nettle_aes_decrypt_w1]
        srl     r2,r2,<0:4>
        and     r2,r2,#20
        lduw    r2,[r1,r2,800]
        stw     r2,[r1,800]
        mov     r1,#0
        ret

I don't see anything to optimize there any more.

Thanks!

bagel99 / llvm-my66000

Opportunity for combining add and lduw #37