TrenchBoot / landing-zone

An open source implementation of an AMD-V Secure Loader.
GNU General Public License v2.0
23 stars 7 forks source link

sha256: Roll iterations inside sha256_transform into loop #24

Closed krystian-hebel closed 4 years ago

krystian-hebel commented 4 years ago

64: add/remove: 1/0 grow/shrink: 0/1 up/down: 256/-4490 (-4234) Function old new delta K - 256 +256 sha256_update 7527 3037 -4490 Total: Before=60371, After=56137, chg -7.01%

32: add/remove: 1/0 grow/shrink: 0/1 up/down: 256/-4621 (-4365) Function old new delta K - 256 +256 sha256_update 7646 3025 -4621 Total: Before=31566, After=27201, chg -13.83%

Deltas are the same with and without LTO.

Signed-off-by: Krystian Hebel krystian.hebel@3mdeb.com

andyhhp commented 4 years ago

With GCC 6.3.0 from Debian, I do actually see an LTO improvement:

64:
add/remove: 1/0 grow/shrink: 0/1 up/down: 256/-4670 (-4414)
Function                                     old     new   delta
K                                              -     256    +256
sha256_update                               7612    2942   -4670
Total: Before=60531, After=56117, chg -7.29%

32:
add/remove: 1/0 grow/shrink: 0/1 up/down: 256/-4561 (-4305)
Function                                     old     new   delta
K                                              -     256    +256
sha256_update                               7649    3088   -4561
Total: Before=31577, After=27272, chg -13.63%

lto.64:
add/remove: 3/2 grow/shrink: 0/0 up/down: 3300/-8004 (-4704)
Function                                     old     new   delta
sha256_update                                  -    2942   +2942
K                                              -     256    +256
BLEND_OP                                       -     102    +102
BLEND_OP.lto_priv                            102       -    -102
sha256_update.lto_priv                      7902       -   -7902
Total: Before=60817, After=56113, chg -7.73%

lto.32:
add/remove: 3/2 grow/shrink: 0/0 up/down: 3461/-7825 (-4364)
Function                                     old     new   delta
sha256_update                                  -    3088   +3088
K                                              -     256    +256
BLEND_OP                                       -     117    +117
BLEND_OP.lto_priv                            117       -    -117
sha256_update.lto_priv                      7708       -   -7708
Total: Before=31381, After=27017, chg -13.91%

Interestingly, an extra 3 stack slots used (and the disassembly highlighting one of the most corner case optimisations I've ever seen a compiler make...)

1a89:       48 83 ec 68             sub    $0x68,%rsp

vs

1a89:       48 83 c4 80             add    $0xffffffffffffff80,%rsp

Either way, a massive improvement. Space is at a premium, and this will all fit in the L1 cache. (Its liable to be a little faster, as you're not causing the instruction decode to be the thing producing these constants.)