Closed tkoenig1 closed 1 year ago
It could be further combined into (unless I get this wrong)
lduw r2,[r1,r2,800]
saving one instruction and one register (r3), but not code size.
Yes, I think you are right. I have a version (which needs more testing) that generates: _nettle_aes_decrypt: ; @_nettle_aes_decrypt ldd r1,[ip,_nettle_aes_decrypt_T] lduw r2,[ip,_nettle_aes_decrypt_w1] srl r2,r2,<0:4> and r2,r2,#20 lduw r2,[r1,r2,800] stw r2,[r1,800] ret
Try again with commit af7ee91ac91db0ac489c686eb35daf15ceb1f32f
Code generated now is
ldd r1,[ip,_nettle_aes_decrypt_T]
lduw r2,[ip,_nettle_aes_decrypt_w1]
srl r2,r2,<0:4>
and r2,r2,#20
lduw r2,[r1,r2,800]
stw r2,[r1,800]
mov r1,#0
ret
I don't see anything to optimize there any more.
Thanks!
Mitch mentioned on comp.arch that reducing code size for embench (especially with compared to RISC-V) is of interest, so I thought I'd take a look at the generated code form that benchmark.
I noticed code in nettle-aes.c which could be improved for instruction count and code size. Here is a reduced test case:
gives me
where the
add r2,r3,r2
and thelduw r2,[r2]
could be combined to a single indexed load.The reduction was done with a little Perl script, with
Compiler version was
clang version 15.0.0 (https://github.com/bagel99/llvm-my66000.git 5a26c794e6a7bbe1692c2c4c4399b95ef8124c16)
, Release build.