Open HLRichardson-Git opened 1 month ago
As this change could effect performance of the implementation I took a benchmark of the current implementation. Here is that benchmark:
#include <benchmark/benchmark.h>
#include <string>
#include <gestalt/aes.h>
static void BM_AESEncryptECB(benchmark::State& state) {
const std::string key = "10a58869d74be5a374cf867cfb473859";
std::string pt;
pt.resize(state.range(0), 'A');
std::string ct;
for (auto _ : state) {
ct = encryptAESECB(pt, key);
benchmark::DoNotOptimize(ct);
}
}
BENCHMARK(BM_AESEncryptECB)->DenseRange(1024, 65536, 2048);
BENCHMARK_MAIN();
Running this benchmark I received these results:
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BM_AESEncryptECB/1024 70809 ns 71498 ns 8960
BM_AESEncryptECB/3072 205943 ns 205078 ns 3200
BM_AESEncryptECB/5120 358817 ns 353021 ns 2036
BM_AESEncryptECB/7168 459763 ns 460482 ns 1493
BM_AESEncryptECB/9216 598512 ns 599888 ns 1120
BM_AESEncryptECB/11264 731372 ns 732422 ns 896
BM_AESEncryptECB/13312 869547 ns 878514 ns 747
BM_AESEncryptECB/15360 997502 ns 1004016 ns 747
BM_AESEncryptECB/17408 1120059 ns 1116071 ns 560
BM_AESEncryptECB/19456 1242621 ns 1223645 ns 498
BM_AESEncryptECB/21504 1388437 ns 1395089 ns 448
BM_AESEncryptECB/23552 1524054 ns 1534598 ns 448
BM_AESEncryptECB/25600 1637397 ns 1650799 ns 407
BM_AESEncryptECB/27648 1773527 ns 1759383 ns 373
BM_AESEncryptECB/29696 1940738 ns 1968834 ns 373
BM_AESEncryptECB/31744 2047902 ns 2038043 ns 345
BM_AESEncryptECB/33792 2192906 ns 2197266 ns 320
BM_AESEncryptECB/35840 2306850 ns 2299331 ns 299
BM_AESEncryptECB/37888 2432239 ns 2426610 ns 264
BM_AESEncryptECB/39936 2578160 ns 2544981 ns 264
BM_AESEncryptECB/41984 2707504 ns 2698293 ns 249
BM_AESEncryptECB/44032 2834147 ns 2846928 ns 236
BM_AESEncryptECB/46080 3039373 ns 2999442 ns 224
BM_AESEncryptECB/48128 3215499 ns 3208705 ns 224
BM_AESEncryptECB/50176 3531040 ns 3525641 ns 195
BM_AESEncryptECB/52224 3716222 ns 3753492 ns 179
BM_AESEncryptECB/54272 3618258 ns 3592914 ns 187
BM_AESEncryptECB/56320 3650675 ns 3578911 ns 179
BM_AESEncryptECB/58368 3970371 ns 4010695 ns 187
BM_AESEncryptECB/60416 4124791 ns 4178779 ns 172
BM_AESEncryptECB/62464 4050373 ns 4087936 ns 172
BM_AESEncryptECB/64512 4213611 ns 4199219 ns 160
And here is that data graphed, where the x axis is the input sizes and the y axis is the time in ns:
Optimistically, the current implementation is very closely linear as the input size increases. The goal is to not effect the performance with this refactor.
I was sick for the past few days so I didn't work on this as much as I'd like to. Since the last update though I was able to closely match AES and the ECB template to how I believe it could be reused for both AES and DES. But after committing it I took a look at the new benchmark. Here is the raw results:
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BM_AESEncryptECB/1024 226801 ns 224609 ns 3200
BM_AESEncryptECB/3072 642614 ns 641741 ns 1120
BM_AESEncryptECB/5120 1047743 ns 1049805 ns 640
BM_AESEncryptECB/7168 1436925 ns 1443273 ns 498
BM_AESEncryptECB/9216 1990894 ns 1926944 ns 373
BM_AESEncryptECB/11264 2264091 ns 2232143 ns 280
BM_AESEncryptECB/13312 2645481 ns 2635542 ns 249
BM_AESEncryptECB/15360 3077630 ns 3138951 ns 224
BM_AESEncryptECB/17408 3591912 ns 3599877 ns 204
BM_AESEncryptECB/19456 4522488 ns 4518072 ns 166
BM_AESEncryptECB/21504 5531717 ns 4882812 ns 112
BM_AESEncryptECB/23552 5067285 ns 5000000 ns 100
BM_AESEncryptECB/25600 5213305 ns 5161830 ns 112
BM_AESEncryptECB/27648 5650490 ns 5468750 ns 100
BM_AESEncryptECB/29696 6030426 ns 5859375 ns 112
BM_AESEncryptECB/31744 6477921 ns 6277902 ns 112
BM_AESEncryptECB/33792 6760553 ns 6835938 ns 112
BM_AESEncryptECB/35840 7162940 ns 7118056 ns 90
BM_AESEncryptECB/37888 7611707 ns 7638889 ns 90
BM_AESEncryptECB/39936 7654735 ns 7500000 ns 75
BM_AESEncryptECB/41984 8348188 ns 8333333 ns 75
BM_AESEncryptECB/44032 8673837 ns 8750000 ns 75
BM_AESEncryptECB/46080 8995873 ns 9166667 ns 75
BM_AESEncryptECB/48128 9332145 ns 9277344 ns 64
BM_AESEncryptECB/50176 9967181 ns 10000000 ns 75
BM_AESEncryptECB/52224 10224565 ns 10208333 ns 75
BM_AESEncryptECB/54272 10723784 ns 10742188 ns 64
BM_AESEncryptECB/56320 10826266 ns 10986328 ns 64
BM_AESEncryptECB/58368 11799203 ns 11718750 ns 64
BM_AESEncryptECB/60416 11498950 ns 11439732 ns 56
BM_AESEncryptECB/62464 11472671 ns 11439732 ns 56
BM_AESEncryptECB/64512 11942468 ns 11997768 ns 56
Probably isn't obvious yet, so I also plotted it on a graph with the new benchmark in purple and thelast benchmark in green again:
Not ideal...
I thought the initial try may decrease performance. But not by this much. I'll have to look into it some more and see if I can improve the performance. But I'm worried it will be to much hassle when I should just be implementing the modes of operation independently for each block cipher.
I'm starting to feel better, so hopefully I can begin to start coding more regularly. In time though I went back and forth a bunch in my head trying to decide if we want to continue trying to make these modes of operation templates work, and who knows maybe I learn more in the future and come back to this problem. But for now, we are going with separate implementations for each mode of operation for each block cipher.
(Lets be honest that wasn't really bad news... right?)
While trying to optimize the templates, I had the chance to re-look over my AES implementation and found a few places I could make optimizations. Those optimizations can be seen in the latest commit, but for those who don't care here are the new benchmarks:
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BM_AESEncryptECB/1024 47601 ns 47433 ns 11200
BM_AESEncryptECB/3072 136129 ns 134969 ns 4978
BM_AESEncryptECB/5120 225360 ns 224609 ns 3200
BM_AESEncryptECB/7168 311890 ns 313895 ns 2240
BM_AESEncryptECB/9216 401819 ns 401088 ns 1792
BM_AESEncryptECB/11264 490531 ns 500000 ns 1000
BM_AESEncryptECB/13312 579231 ns 571987 ns 1120
BM_AESEncryptECB/15360 676535 ns 683594 ns 1120
BM_AESEncryptECB/17408 810716 ns 819615 ns 896
BM_AESEncryptECB/19456 872029 ns 871931 ns 896
BM_AESEncryptECB/21504 951488 ns 962182 ns 747
BM_AESEncryptECB/23552 1023829 ns 1025391 ns 640
BM_AESEncryptECB/25600 1123748 ns 1123047 ns 640
BM_AESEncryptECB/27648 1179485 ns 1171875 ns 560
BM_AESEncryptECB/29696 1232684 ns 1227679 ns 560
BM_AESEncryptECB/31744 1316355 ns 1311384 ns 560
BM_AESEncryptECB/33792 1544196 ns 1537400 ns 498
BM_AESEncryptECB/35840 1508360 ns 1499721 ns 448
BM_AESEncryptECB/37888 1670428 ns 1650799 ns 407
BM_AESEncryptECB/39936 1913689 ns 1881143 ns 407
BM_AESEncryptECB/41984 1855895 ns 1843164 ns 373
BM_AESEncryptECB/44032 1930365 ns 1902174 ns 345
BM_AESEncryptECB/46080 2029674 ns 2038043 ns 345
BM_AESEncryptECB/48128 2314392 ns 2299331 ns 299
BM_AESEncryptECB/50176 2870023 ns 2790179 ns 280
BM_AESEncryptECB/52224 2487778 ns 2485795 ns 264
BM_AESEncryptECB/54272 2477391 ns 2455357 ns 280
BM_AESEncryptECB/56320 2539343 ns 2511161 ns 280
BM_AESEncryptECB/58368 2632226 ns 2604167 ns 264
BM_AESEncryptECB/60416 2627194 ns 2566964 ns 280
BM_AESEncryptECB/62464 2656403 ns 2663352 ns 264
BM_AESEncryptECB/64512 2751655 ns 2781723 ns 264
To compare the difference I added these results again to the graph, the new implementation is in yellow this time:
Feels good to not have regressed the implementation :)
Notably, it is more "bumpy" which is kinda interesting. But I am decently proud of it, and I know there are a lot of other optimizations we can make which are discussed in this issue if you're interested.
Summary
In the latest version of Gestalt as of July, 30th 2024 version 0.6 we introduced the second block cipher DES (TDES as well if you count that as a separate cipher). In this, we ran into the not-at-all-unexpected problem of trying to use the current template functions for the modes of operations (ECB, CBC, etc) that were previously only use for AES.
Given that the AES implementation and DES implementation are fundamentally different in how blocks are "divided" i.e:
This means it isn't easy to make the template work for both block ciphers. I propose we refactor the AES implementation to better match the implementation of DES.
Motivation
The solution we settled on for version 0.6 was to create an independent implementation of ECB and CBC for DES and TDES. This is not ideal as we add more block ciphers it will become more and more of a headache to try and maintain them all.
Additional context
The way I plan on refactoring AES is to actually largely leave the implementation as it is. This is because I just want to change the implementation to always expect a
unsigned char block[16]
instead of a pointer to a unsigned char array (who knows maybe this will help performance). Then we rework the modes of operation template so we can create a vector of theseunsigned char block[16]
arrays, similar to how we make a vector ofuint64_t
for DES' mode of operations.Additionally, I would like to change how we do the current encryption/ decryption in place to return the computed value.