Generalized templates for Block Cipher Modes of Operations

HLRichardson-Git commented 1 month ago

Summary

In the latest version of Gestalt as of July, 30th 2024 version 0.6 we introduced the second block cipher DES (TDES as well if you count that as a separate cipher). In this, we ran into the not-at-all-unexpected problem of trying to use the current template functions for the modes of operations (ECB, CBC, etc) that were previously only use for AES.

Given that the AES implementation and DES implementation are fundamentally different in how blocks are "divided" i.e:

AES uses an unsigned char array pointer
DES uses uint64_t

This means it isn't easy to make the template work for both block ciphers. I propose we refactor the AES implementation to better match the implementation of DES.

Motivation

The solution we settled on for version 0.6 was to create an independent implementation of ECB and CBC for DES and TDES. This is not ideal as we add more block ciphers it will become more and more of a headache to try and maintain them all.

Additional context

The way I plan on refactoring AES is to actually largely leave the implementation as it is. This is because I just want to change the implementation to always expect a unsigned char block[16] instead of a pointer to a unsigned char array (who knows maybe this will help performance). Then we rework the modes of operation template so we can create a vector of these unsigned char block[16] arrays, similar to how we make a vector of uint64_t for DES' mode of operations.

Additionally, I would like to change how we do the current encryption/ decryption in place to return the computed value.

HLRichardson-Git commented 1 month ago

As this change could effect performance of the implementation I took a benchmark of the current implementation. Here is that benchmark:

#include <benchmark/benchmark.h>
#include <string>
#include <gestalt/aes.h>

static void BM_AESEncryptECB(benchmark::State& state) {
    const std::string key = "10a58869d74be5a374cf867cfb473859";
    std::string pt;
    pt.resize(state.range(0), 'A');
    std::string ct;

    for (auto _ : state) {
        ct = encryptAESECB(pt, key);
        benchmark::DoNotOptimize(ct);
    }
}

BENCHMARK(BM_AESEncryptECB)->DenseRange(1024, 65536, 2048);
BENCHMARK_MAIN();

Running this benchmark I received these results:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_AESEncryptECB/1024       70809 ns        71498 ns         8960
BM_AESEncryptECB/3072      205943 ns       205078 ns         3200
BM_AESEncryptECB/5120      358817 ns       353021 ns         2036
BM_AESEncryptECB/7168      459763 ns       460482 ns         1493
BM_AESEncryptECB/9216      598512 ns       599888 ns         1120
BM_AESEncryptECB/11264     731372 ns       732422 ns          896
BM_AESEncryptECB/13312     869547 ns       878514 ns          747
BM_AESEncryptECB/15360     997502 ns      1004016 ns          747
BM_AESEncryptECB/17408    1120059 ns      1116071 ns          560
BM_AESEncryptECB/19456    1242621 ns      1223645 ns          498
BM_AESEncryptECB/21504    1388437 ns      1395089 ns          448
BM_AESEncryptECB/23552    1524054 ns      1534598 ns          448
BM_AESEncryptECB/25600    1637397 ns      1650799 ns          407
BM_AESEncryptECB/27648    1773527 ns      1759383 ns          373
BM_AESEncryptECB/29696    1940738 ns      1968834 ns          373
BM_AESEncryptECB/31744    2047902 ns      2038043 ns          345
BM_AESEncryptECB/33792    2192906 ns      2197266 ns          320
BM_AESEncryptECB/35840    2306850 ns      2299331 ns          299
BM_AESEncryptECB/37888    2432239 ns      2426610 ns          264
BM_AESEncryptECB/39936    2578160 ns      2544981 ns          264
BM_AESEncryptECB/41984    2707504 ns      2698293 ns          249
BM_AESEncryptECB/44032    2834147 ns      2846928 ns          236
BM_AESEncryptECB/46080    3039373 ns      2999442 ns          224
BM_AESEncryptECB/48128    3215499 ns      3208705 ns          224
BM_AESEncryptECB/50176    3531040 ns      3525641 ns          195
BM_AESEncryptECB/52224    3716222 ns      3753492 ns          179
BM_AESEncryptECB/54272    3618258 ns      3592914 ns          187
BM_AESEncryptECB/56320    3650675 ns      3578911 ns          179
BM_AESEncryptECB/58368    3970371 ns      4010695 ns          187
BM_AESEncryptECB/60416    4124791 ns      4178779 ns          172
BM_AESEncryptECB/62464    4050373 ns      4087936 ns          172
BM_AESEncryptECB/64512    4213611 ns      4199219 ns          160

And here is that data graphed, where the x axis is the input sizes and the y axis is the time in ns:

Time ns vs Input

Optimistically, the current implementation is very closely linear as the input size increases. The goal is to not effect the performance with this refactor.

HLRichardson-Git commented 1 month ago

I was sick for the past few days so I didn't work on this as much as I'd like to. Since the last update though I was able to closely match AES and the ECB template to how I believe it could be reused for both AES and DES. But after committing it I took a look at the new benchmark. Here is the raw results:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_AESEncryptECB/1024      226801 ns       224609 ns         3200
BM_AESEncryptECB/3072      642614 ns       641741 ns         1120
BM_AESEncryptECB/5120     1047743 ns      1049805 ns          640
BM_AESEncryptECB/7168     1436925 ns      1443273 ns          498
BM_AESEncryptECB/9216     1990894 ns      1926944 ns          373
BM_AESEncryptECB/11264    2264091 ns      2232143 ns          280
BM_AESEncryptECB/13312    2645481 ns      2635542 ns          249
BM_AESEncryptECB/15360    3077630 ns      3138951 ns          224
BM_AESEncryptECB/17408    3591912 ns      3599877 ns          204
BM_AESEncryptECB/19456    4522488 ns      4518072 ns          166
BM_AESEncryptECB/21504    5531717 ns      4882812 ns          112
BM_AESEncryptECB/23552    5067285 ns      5000000 ns          100
BM_AESEncryptECB/25600    5213305 ns      5161830 ns          112
BM_AESEncryptECB/27648    5650490 ns      5468750 ns          100
BM_AESEncryptECB/29696    6030426 ns      5859375 ns          112
BM_AESEncryptECB/31744    6477921 ns      6277902 ns          112
BM_AESEncryptECB/33792    6760553 ns      6835938 ns          112
BM_AESEncryptECB/35840    7162940 ns      7118056 ns           90
BM_AESEncryptECB/37888    7611707 ns      7638889 ns           90
BM_AESEncryptECB/39936    7654735 ns      7500000 ns           75
BM_AESEncryptECB/41984    8348188 ns      8333333 ns           75
BM_AESEncryptECB/44032    8673837 ns      8750000 ns           75
BM_AESEncryptECB/46080    8995873 ns      9166667 ns           75
BM_AESEncryptECB/48128    9332145 ns      9277344 ns           64
BM_AESEncryptECB/50176    9967181 ns     10000000 ns           75
BM_AESEncryptECB/52224   10224565 ns     10208333 ns           75
BM_AESEncryptECB/54272   10723784 ns     10742188 ns           64
BM_AESEncryptECB/56320   10826266 ns     10986328 ns           64
BM_AESEncryptECB/58368   11799203 ns     11718750 ns           64
BM_AESEncryptECB/60416   11498950 ns     11439732 ns           56
BM_AESEncryptECB/62464   11472671 ns     11439732 ns           56
BM_AESEncryptECB/64512   11942468 ns     11997768 ns           56

Probably isn't obvious yet, so I also plotted it on a graph with the new benchmark in purple and thelast benchmark in green again: Time ns vs Input (1)

Not ideal...

I thought the initial try may decrease performance. But not by this much. I'll have to look into it some more and see if I can improve the performance. But I'm worried it will be to much hassle when I should just be implementing the modes of operation independently for each block cipher.

HLRichardson-Git commented 1 month ago

I'm starting to feel better, so hopefully I can begin to start coding more regularly. In time though I went back and forth a bunch in my head trying to decide if we want to continue trying to make these modes of operation templates work, and who knows maybe I learn more in the future and come back to this problem. But for now, we are going with separate implementations for each mode of operation for each block cipher.

My Rationale:

Yes, it will be "more to maintain" but we will also have greater and finer control to make algorithm-specific optimizations for modes of operations for a given block cipher with independent implementations.
Implementing block ciphers will not be constrained by knowing we have to make the implementation fit into the template box we have made.
Its simpler in my opinion :)
And many more totally rash and fair reasons I have.

Its not all bad news...

(Lets be honest that wasn't really bad news... right?)

While trying to optimize the templates, I had the chance to re-look over my AES implementation and found a few places I could make optimizations. Those optimizations can be seen in the latest commit, but for those who don't care here are the new benchmarks:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_AESEncryptECB/1024       47601 ns        47433 ns        11200
BM_AESEncryptECB/3072      136129 ns       134969 ns         4978
BM_AESEncryptECB/5120      225360 ns       224609 ns         3200
BM_AESEncryptECB/7168      311890 ns       313895 ns         2240
BM_AESEncryptECB/9216      401819 ns       401088 ns         1792
BM_AESEncryptECB/11264     490531 ns       500000 ns         1000
BM_AESEncryptECB/13312     579231 ns       571987 ns         1120
BM_AESEncryptECB/15360     676535 ns       683594 ns         1120
BM_AESEncryptECB/17408     810716 ns       819615 ns          896
BM_AESEncryptECB/19456     872029 ns       871931 ns          896
BM_AESEncryptECB/21504     951488 ns       962182 ns          747
BM_AESEncryptECB/23552    1023829 ns      1025391 ns          640
BM_AESEncryptECB/25600    1123748 ns      1123047 ns          640
BM_AESEncryptECB/27648    1179485 ns      1171875 ns          560
BM_AESEncryptECB/29696    1232684 ns      1227679 ns          560
BM_AESEncryptECB/31744    1316355 ns      1311384 ns          560
BM_AESEncryptECB/33792    1544196 ns      1537400 ns          498
BM_AESEncryptECB/35840    1508360 ns      1499721 ns          448
BM_AESEncryptECB/37888    1670428 ns      1650799 ns          407
BM_AESEncryptECB/39936    1913689 ns      1881143 ns          407
BM_AESEncryptECB/41984    1855895 ns      1843164 ns          373
BM_AESEncryptECB/44032    1930365 ns      1902174 ns          345
BM_AESEncryptECB/46080    2029674 ns      2038043 ns          345
BM_AESEncryptECB/48128    2314392 ns      2299331 ns          299
BM_AESEncryptECB/50176    2870023 ns      2790179 ns          280
BM_AESEncryptECB/52224    2487778 ns      2485795 ns          264
BM_AESEncryptECB/54272    2477391 ns      2455357 ns          280
BM_AESEncryptECB/56320    2539343 ns      2511161 ns          280
BM_AESEncryptECB/58368    2632226 ns      2604167 ns          264
BM_AESEncryptECB/60416    2627194 ns      2566964 ns          280
BM_AESEncryptECB/62464    2656403 ns      2663352 ns          264
BM_AESEncryptECB/64512    2751655 ns      2781723 ns          264

To compare the difference I added these results again to the graph, the new implementation is in yellow this time: Time ns vs Input (2)

Feels good to not have regressed the implementation :)

Notably, it is more "bumpy" which is kinda interesting. But I am decently proud of it, and I know there are a lot of other optimizations we can make which are discussed in this issue if you're interested.

HLRichardson-Git / Gestalt