dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.92k stars 4.64k forks source link

.NET 6 : Aes.EncryptEcb one-shot is slower than ICryptoTransform.TransformFinalBlock #58784

Open sdrapkin opened 3 years ago

sdrapkin commented 3 years ago

Description

void Main() => BenchmarkRunner.Run<Bench>();

public class Bench
{
    byte[] _key, _plaintext, _ciphertext;

    [GlobalSetup]
    public void Setup()
    {
        _key = RandomNumberGenerator.GetBytes(32);
        _plaintext = Encoding.UTF8.GetBytes("This is a test. This is a test. ");

        using var tempAes = Aes.Create();
        tempAes.Key = _key;
        _ciphertext = tempAes.EncryptEcb(_plaintext, PaddingMode.None);
    }

    [Benchmark(Baseline = true)]
    public byte[] Old_Enc()
    {
        using Aes aes = Aes.Create();
        aes.Padding = PaddingMode.None;
        aes.Mode = CipherMode.ECB;

        using var transform = aes.CreateEncryptor(_key, null);
        return transform.TransformFinalBlock(_plaintext, 0, _plaintext.Length);
    }

    [Benchmark]
    public byte[] Old_Dec()
    {
        using Aes aes = Aes.Create();
        aes.Padding = PaddingMode.None;
        aes.Mode = CipherMode.ECB;

        using var transform = aes.CreateDecryptor(_key, null);
        return transform.TransformFinalBlock(_ciphertext, 0, _ciphertext.Length);
    }

    [Benchmark]
    public byte[] New_Enc()
    {
        using Aes aes = Aes.Create();
        aes.Key = _key;
        return aes.EncryptEcb((ReadOnlySpan<byte>)_plaintext, PaddingMode.None);
    }

    [Benchmark]
    public byte[] New_Dec()
    {
        using Aes aes = Aes.Create();
        aes.Key = _key;
        return aes.DecryptEcb((ReadOnlySpan<byte>)_ciphertext, PaddingMode.None);
    }
}

Configuration

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1202 (20H2/October2020Update)
Intel Core i7-10510U CPU 1.80GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.7.21379.14
  [Host] : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT

Results

Method Mean Error StdDev Ratio RatioSD
Old_Enc 725.8 ns 7.82 ns 6.53 ns 1.00 0.00
Old_Dec 805.9 ns 15.83 ns 16.25 ns 1.11 0.02
New_Enc 977.5 ns 7.90 ns 6.60 ns 1.35 0.02
New_Dec 1,016.8 ns 14.85 ns 13.16 ns 1.40 0.02

Analysis

EncryptEcb and DecryptEcb appear to be consistently 30~35% slower than the old approach of ICryptoTransform.TransformFinalBlock.

The bench code is a slightly modified version of @stephentoub example from his "Performance Improvements in .NET 6" post, which seems to suggest that these new one-shot methods are supposed to be faster - not slower.

@GrabYourPitchforks @bartonjs

ghost commented 3 years ago

Tagging subscribers to this area: @bartonjs, @vcsjones, @krwq, @GrabYourPitchForks See info in area-owners.md if you want to be subscribed.

Issue Details
### Description ```csharp void Main() => BenchmarkRunner.Run(); public class Bench { byte[] _key, _plaintext, _ciphertext; [GlobalSetup] public void Setup() { _key = RandomNumberGenerator.GetBytes(32); _plaintext = Encoding.UTF8.GetBytes("This is a test. This is a test. "); using var tempAes = Aes.Create(); tempAes.Key = _key; _ciphertext = tempAes.EncryptEcb(_plaintext, PaddingMode.None); } [Benchmark(Baseline = true)] public byte[] Old_Enc() { using Aes aes = Aes.Create(); aes.Padding = PaddingMode.None; aes.Mode = CipherMode.ECB; using var transform = aes.CreateEncryptor(_key, null); return transform.TransformFinalBlock(_plaintext, 0, _plaintext.Length); } [Benchmark] public byte[] Old_Dec() { using Aes aes = Aes.Create(); aes.Padding = PaddingMode.None; aes.Mode = CipherMode.ECB; using var transform = aes.CreateDecryptor(_key, null); return transform.TransformFinalBlock(_ciphertext, 0, _ciphertext.Length); } [Benchmark] public byte[] New_Enc() { using Aes aes = Aes.Create(); aes.Key = _key; return aes.EncryptEcb((ReadOnlySpan)_plaintext, PaddingMode.None); } [Benchmark] public byte[] New_Dec() { using Aes aes = Aes.Create(); aes.Key = _key; return aes.DecryptEcb((ReadOnlySpan)_ciphertext, PaddingMode.None); } } ``` ### Configuration ``` BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1202 (20H2/October2020Update) Intel Core i7-10510U CPU 1.80GHz, 1 CPU, 8 logical and 4 physical cores .NET SDK=6.0.100-preview.7.21379.14 [Host] : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT ``` ### Results | Method | Mean | Error | StdDev | Ratio | RatioSD | |-------- |-----------:|---------:|---------:|------:|--------:| | Old_Enc | 725.8 ns | 7.82 ns | 6.53 ns | 1.00 | 0.00 | | Old_Dec | 805.9 ns | 15.83 ns | 16.25 ns | 1.11 | 0.02 | | New_Enc | 977.5 ns | 7.90 ns | 6.60 ns | 1.35 | 0.02 | | New_Dec | 1,016.8 ns | 14.85 ns | 13.16 ns | 1.40 | 0.02 | ### Analysis `EncryptEcb` and `DecryptEcb` appear to be consistently 30~35% slower than the old approach of `ICryptoTransform.TransformFinalBlock`. The bench code is a slightly modified version of @stephentoub example from his "[Performance Improvements in .NET 6](https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-6/)" post, which seems to suggest that these new one-shot methods are supposed to be faster - not slower. @GrabYourPitchforks @bartonjs
Author: sdrapkin
Assignees: -
Labels: `area-System.Security`, `tenet-performance`, `untriaged`
Milestone: -
vcsjones commented 3 years ago

Performance improvements for the one shots are being tracked in #55601.

GrabYourPitchforks commented 3 years ago

Yeah, I'm not too worried about the perf differences here. It's a fixed overhead per method invocation, not 30\~35% slower as a whole. And even with the perf measurements here, the fixed overhead for small inputs as measured is only 4\~11%. And as Kevin said, the overhead can be further reduced once we find some extra time to tackle it.

sdrapkin commented 3 years ago

Of course it's a fixed overhead per method invocation - if the actual ECB encryption/decryption call to BCryptDecrypt was somehow 30% slower to transform destination buffers, that would've been quite a feat (it's not).

Let's do another bench - this time without new-array creation (ie. trying to be as lean as possible):

void Main() => BenchmarkRunner.Run<Bench>();

public class Bench
{
    byte[] _key, _plaintext, _ciphertext;
    Aes _aes;

    [GlobalSetup]
    public void Setup()
    {
        _key = RandomNumberGenerator.GetBytes(32);
        _plaintext = Encoding.UTF8.GetBytes("This is a test. This is a test. ");
        _ciphertext = new byte[_plaintext.Length];

        using var aes = Aes.Create();
        aes.Padding = PaddingMode.None;
        aes.Mode = CipherMode.ECB;
        _aes = aes;
    }

    [Benchmark(Baseline = true)]
    public void Old_Enc()
    {
        using var encryptor = _aes.CreateEncryptor(rgbKey: _key, rgbIV: null);
        encryptor.TransformBlock(inputBuffer: _plaintext, inputOffset: 0, inputCount: _plaintext.Length, outputBuffer: _ciphertext, outputOffset: 0);
    }

    [Benchmark]
    public void New_Enc()
    {
        _aes.Key = _key;
        _aes.TryEncryptEcb(plaintext: _plaintext, destination: _ciphertext, PaddingMode.None, out int _);
    }
}
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1202 (20H2/October2020Update)
Intel Core i7-10510U CPU 1.80GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.7.21379.14
  [Host] : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT
Method Mean Error StdDev Ratio RatioSD
Old_Enc 567.5 ns 8.59 ns 7.18 ns 1.00 0.00
New_Enc 817.6 ns 14.65 ns 15.68 ns 1.45 0.03

If it is not performance, what is the purpose of all these new one-shot APIs? Is it merely call-convenience and support for Span<byte> which is missing from ICryptoTransform? The .NET 6 docs have not yet been updated for these new APIs...

Hopefully the .NET team will find some time for perf improvements (performance is a feature).

GrabYourPitchforks commented 3 years ago

If it is not performance, what is the purpose of all these new one-shot APIs?

I never said performance was a non-goal. But the primary goal was ease of use.

Let's do another bench - this time without new-array creation (ie. trying to be as lean as possible)

Your benchmarks are still measuring different things. For example, your New_Enc method measures the cost of setting the Key property (which has non-trivial logic), whereas your Old_Enc method does not measure this cost. Additionally, TransformBlock is not guaranteed to perform equivalent work to TryEncryptEcb.

Your original benchmark code was more representative of comparing old-style code to an equivalent one-shot call.

sdrapkin commented 3 years ago

Your benchmarks are still measuring different things.

I would respectfully disagree and argue that my benchmark is measuring the same things: a way to achieve a specific task with old-vs-new APIs. The number/type of methods called are due to how the APIs are designed by Microsoft (both old and new). @stephentoub benchmark did the same thing by comparing the CryptoStream creation with old API versus no-CryptoStream EncryptECB API, and claiming the new API is faster (ie. he is measuring the perf of accomplishing a task).

But let me explain why the Key property needs to be set. The task is to benchmark the following per-call:

The way the old API is designed, one has to set ECB-mode and Padding on the Aes object, but the Key is passed to the CreateEncryptor call.

The way the new API is designed, one sets ECB-mode by calling the correspondingly-named (Try)EncryptECB method, and the Padding is parameter-passed. But for some reason the Key is not part of the one-shot methods, and needs to be set on the Aes object.

Is there a better, more equivalent/idiomatic/"fair" way of calling the new APIs with a new Key-per-call? If not, the current design also makes one-shot methods not really one-shot if the Key must first be set/reset on Aes prior to every call. Perhaps I missed something, since the new APIs are not yet documented - please share a better benchmark to accomplish the above task with new one-shot APIs...

bartonjs commented 3 years ago

Is there a better, more equivalent/idiomatic/"fair" way of calling the new APIs with a new Key-per-call?

No. If your use case is really just using the Aes instance as an algorithm-factory, then you've done the best you can do. We made the new methods in the style of the methods on AesGcm/AesCcm/ChaCha20Poly1305, where we made the instance represent a key and the methods represent an operation with the key.

The way the old API is designed, ... the Key is passed to the CreateEncryptor call.

That is one of the two paradigms it supported. That paradigm doesn't work for CNG's persisted symmetric keys, and (IIRC) has significantly lower usage. The more common paradigm is using the Key and IV properties and calling the zero-argument CreateEncryptor/CreateDecryptor, which also works for AesCng and TripleDESCng when they are wrapped over a persisted key. (That is, the zero-argument CreateEncryptor works for persisted keys, setting the Key property would obviously not be the right thing to do in that situation)

deeprobin commented 2 years ago

I could not find a notable difference in benchmarking between the .NET 6 and .NET 7 implementations.

I think in TransformFinal/TransformOneShot is bottle neck

Profiled New_Enc: grafik

Profiled Old_Enc: grafik

Specs: Win x64

deeprobin commented 2 years ago

In TryEncryptEcbCore we call BCryptImportKey every time we call TryEncryptEcbCore (see profiled New_Enc). Is it not possible to cache something like this?

Also, these fixed constructs worry me. Couldn't this be done better with Unsafe methods or Spans? https://github.com/dotnet/runtime/blob/ef0e1a9b7374339b0d6fff18c2404399c6574f7c/src/libraries/Common/src/Interop/Windows/BCrypt/Interop.BCryptImportKey.cs#L14-L51

We also often call CopyTo on Spans which is actually slower than using CopyTo on array level (https://github.com/dotnet/runtime/issues/29093).

Also, using a CryptoPool (which is just a reference to ArrayPool<byte>.Shared - AFAIK no special security mechanism behind it) may not make sense in every case.

For example, if the output span falls below a certain length you might consider using a stackalloc.

vcsjones commented 2 years ago

Is it not possible to cache something like this?

It was not until very recently, and it was only recently determined that breaking thread safety of the one-shots was an acceptable thing to do.

I have a few proof of concepts ongoing to improve performance of the one shots and some recent refactoring to make those possible have been only very recently merged.

vcsjones commented 2 years ago

Also, these fixed constructs worry me.

What specifically is your concern?

vcsjones commented 2 years ago

To summarize though - work is ongoing for .NET 7 that should improve the situation.

bartonjs commented 2 years ago

Also, using a CryptoPool (which is just a reference to ArrayPool.Shared - AFAIK no special security mechanism behind it)

It was originally separate, but ended up being separate-per-assembly. Now that we've condensed 6 assemblies down to 1 it'll probably go back to being a completely separate pool.

deeprobin commented 2 years ago

Is it not possible to cache something like this?

It was not until very recently, and it was only recently determined that breaking thread safety of the one-shots was an acceptable thing to do.

I have a few proof of concepts ongoing to improve performance of the one shots and some recent refactoring to make those possible have been only very recently merged.

* [Pass IV through cipher resets for OpenSSL and Android #64641](https://github.com/dotnet/runtime/pull/64641) was a refactoring to make cipher instances cachable

* [One-shot decrypt directly in to destination with padding #65084](https://github.com/dotnet/runtime/pull/65084) was a proof-of-concept to avoid a pool rental in some cirumstances. I will revive this soon, but it needed to be closed because of some complexities with the CNG implementation that were resolved in [Use lite ciphers for CNG one shots #65151](https://github.com/dotnet/runtime/pull/65151)

* https://github.com/vcsjones/runtime/tree/cache-cipher-oneshots is a branch I spiked out to cache cipher instances which will make multiple implementations of the one shot faster.

To summarize though - work is ongoing for .NET 7 that should improve the situation.

Thats very nice 👍🏼

Also, these fixed constructs worry me.

What specifically is your concern?

Nothing special but somehow I have the feeling that something can be improved. But I am not an interop expert. Codegen is on x86-64 pretty big though [^1].

It was originally separate, but ended up being separate-per-assembly. Now that we've condensed 6 assemblies down to 1 it'll > probably go back to being a completely separate pool.

Okay. As I said, the use of a pool might not always useful. At least not if we allocate something on the stack.

[^1]: sharplab - Godbolt Compiler Explorer

magole commented 2 years ago

@vcsjones does each call to Aes.EncryptEcb result in a key import/expansion and a single encrypt call?

Is it safe to create and set key on AES once and then call EncryptEcb from multiple threads?

Reason I'm asking is for deterministic encryption with the same key from multiple threads. AES key expansion is pretty slow.