sha256: Roll iterations inside sha256_transform into loop

With GCC 6.3.0 from Debian, I do actually see an LTO improvement:

64:
add/remove: 1/0 grow/shrink: 0/1 up/down: 256/-4670 (-4414)
Function                                     old     new   delta
K                                              -     256    +256
sha256_update                               7612    2942   -4670
Total: Before=60531, After=56117, chg -7.29%

32:
add/remove: 1/0 grow/shrink: 0/1 up/down: 256/-4561 (-4305)
Function                                     old     new   delta
K                                              -     256    +256
sha256_update                               7649    3088   -4561
Total: Before=31577, After=27272, chg -13.63%

lto.64:
add/remove: 3/2 grow/shrink: 0/0 up/down: 3300/-8004 (-4704)
Function                                     old     new   delta
sha256_update                                  -    2942   +2942
K                                              -     256    +256
BLEND_OP                                       -     102    +102
BLEND_OP.lto_priv                            102       -    -102
sha256_update.lto_priv                      7902       -   -7902
Total: Before=60817, After=56113, chg -7.73%

lto.32:
add/remove: 3/2 grow/shrink: 0/0 up/down: 3461/-7825 (-4364)
Function                                     old     new   delta
sha256_update                                  -    3088   +3088
K                                              -     256    +256
BLEND_OP                                       -     117    +117
BLEND_OP.lto_priv                            117       -    -117
sha256_update.lto_priv                      7708       -   -7708
Total: Before=31381, After=27017, chg -13.91%

Interestingly, an extra 3 stack slots used (and the disassembly highlighting one of the most corner case optimisations I've ever seen a compiler make...)

1a89:       48 83 ec 68             sub    $0x68,%rsp

1a89:       48 83 c4 80             add    $0xffffffffffffff80,%rsp

Either way, a massive improvement. Space is at a premium, and this will all fit in the L1 cache. (Its liable to be a little faster, as you're not causing the instruction decode to be the thing producing these constants.)

TrenchBoot / landing-zone

sha256: Roll iterations inside sha256_transform into loop #24