Closed henricj closed 1 year ago
@henricj , that's true, the current implementation produces biased random value. Random.Next
uses unbiased version. We might to do the same for cryptographic RNG. Feel free to make a PR.
Primary idea was to touch RNG once outside of the loop to keep perf overhead minimal.
There is unbiased version of NextUInt32
that can be introduced to generate indexes:
static uint NextUInt32(this RandomNumberGenerator random, uint maxValue)
{
Debug.Assert(random is not null);
Unsafe.SkipInit(out uint source);
random.GetBytes(Span.AsBytes(ref source));
var m = (ulong)source * maxValue;
var low = (uint)m;
if (low < maxValue)
{
for (var t = unchecked(~maxValue + 1U) % maxValue; low < t; low = (uint)m)
{
random.GetBytes(Span.AsBytes(ref source));
m = (ulong)source * maxValue;
}
}
return (uint)(m >> 32);
}
Main disadvantage of this approach: NextBytes
must be called N times where N is a length of output string.
Yeah, the CPU time is going to be dominated by calls to random.GetBytes()
(this will likely be something along the lines of NIST's CTR_DRBG, which will get re-seeded for every call and produce output in 128-bit blocks). Generating a batch of random values and passing them instead of a RandomNumberGenerator
instance would be faster, but messy since the number of needed random values is not known ahead of time (C# 11's new scoped
keyword might even come into play).
The simplest solution is to call RandomNumberGenerator.GetInt32(Int32)
, at least where requiring .NET Standard 2.1 or .NET Core is okay.
According to this paper, nearly divisionless method is very suitable for our case. The length of random can be 4 X output string length bytes, because if branch in this listing is very unlikely.
The probability of using no division is (2^L - s)/2^L, which is very close to 1 in our case for smaller s (max value in the range). L is 32. Typically, the user don't want to have random strings with hundreds of thousands or even millions characters.
As a result, we can have two branches of execution:
allowedChars.Length
is a power of 2. This is the most optimized case4 X buffer.Length
and just advances the pointer inside of each iteration of the loop. If pointer overflows (which is very unlikely according to prev analysis), generates a new vector in sets pointer to its initial position (zero).I did a quick profiler run with the above NextUInt32()
suggestion and over 90% of the time is spent generating those cryptographic random numbers. That suggests minimizing the total number of calls to RandomNumberGenerator.GetBytes()
(or .Fill()
).
A couple of observations:
uint
path.RandomNumberGenerator.GetBytes()
will be needed.Random
. There might be a reasonable patch there...All that being said, Knuth's warning about premature optimization being the root of evil comes to mind. (At least, the above stuff is for fun; I wouldn't charge a paying client for such unless there was good indication that it would address a known performance bottleneck.)
Oh, and a point 6:
C23 is getting "memset_explicit()" to try to deal with it.
Oh, and a point 6:
6. A security review would probably insist on any buffers or variables used to handle secrets being zeroized after use. With optimizing compilers and (in this case) managed memory, I'm not really sure how effective that is in reality, but security audits can insist on there being a nominal attempt in the code anyway (however amusingly inadequate the optimizer might find it).
C23 is getting "memset_explicit()" to try to deal with it.
This is the responsibility of the caller to cleanup the buffer of characters. But if the caller prefer to use string
version, we have no chance to cleanup anything.
Check out d6c798cf6dd5d9d6b805791944cc87509d312e23 commit, GetBytes
is outside of the loop and it is very low likelihood to call it again during generation of random numbers.
Pre-populated vector of random values gives significant performance gain for pseudo-random generator as well:
Before optimization:
Method | Mean | Error | StdDev |
---|---|---|---|
RandomString | 234.2 ns | 1.86 ns | 1.65 ns |
GuidString | 1,496.2 ns | 12.01 ns | 11.24 ns |
CryptoRngString | 1,854.8 ns | 14.30 ns | 12.68 ns |
After:
Method | Mean | Error | StdDev |
---|---|---|---|
RandomString | 122.3 ns | 2.06 ns | 1.92 ns |
GuidString | 1,489.3 ns | 8.69 ns | 8.13 ns |
CryptoRngString | 1,830.1 ns | 22.41 ns | 20.96 ns |
Fast branch when allowedChars.Length
is a power of 2 is very efficient as well:
Method | Length | AllowedChars | Mean | Error | StdDev | Median |
---|---|---|---|---|---|---|
RandomString | 16 | 1234567890abcdef | 71.76 ns | 0.841 ns | 0.746 ns | 71.56 ns |
RandomString | 16 | 1234567890abcde | 177.23 ns | 1.792 ns | 1.676 ns | 176.94 ns |
Internal buffer cleanup is added (enabled by default), but can be disabled using DotNext.Security.DisableRandomStringInternalBufferCleanup
app switch from AppContext
class
You've been busy; it looks nice.
Only one question springs to mind: how would using an array of uint
perform compared to current byte
array approach? That would remove the need for Unsafe.ReadUnaligned()
. (It might matter more on ARM64.)
Ok, another question: Which runtime was used for the above performance testing? .NET 7 changed a lot of stuff.
Ok, another question: Which runtime was used for the above performance testing? .NET 7 changed a lot of stuff.
.NET 6.0.11 (6.0.1122.52304), X64 RyuJIT AVX2
I'm not using .NET 7 on my dev machine nor DevOps hosted agents. I think that the most people (including the company where I'm working) prefer to use LTS in production environment.
You've been busy; it looks nice.
Only one question springs to mind: how would using an array of
uint
perform compared to currentbyte
array approach? That would remove the need forUnsafe.ReadUnaligned()
. (It might matter more on ARM64.)
Done. After all optimizations including removal of nuint <-> uint
roundtrip inside of the loop:
Method | Mean | Error | StdDev |
---|---|---|---|
RandomString | 90.97 ns | 0.407 ns | 0.361 ns |
GuidString | 1,504.66 ns | 29.490 ns | 33.961 ns |
CryptoRngString | 1,825.52 ns | 9.935 ns | 11.042 ns |
Added extra optimizations dictated by JIT behavior. RyuJIT prevents inlining of methods containing loops. In our case, the following loop:
if (low < maxValue)
{
for (var t = unchecked(~maxValue + 1U) % maxValue; low < t; low = (uint)m)
{
random.GetBytes(Span.AsBytes(ref source));
m = (ulong)source * maxValue;
}
}
probably will never be executed. However, it prevents inlining of entire method which is called inside of another loop by the caller. Thus, every iteration has overhead needed to call the method (save registers, prologue, etc.).
The first idea is to force inlining by [MethodImpl(MethodImplOptions.AggressiveInlining)]
attribute applied to NextOffset
method. However, it is also bad because the loop will be inlined too. Instead, the better approach is to place rejection code with that loop to the separated method and forbids its inlining using [MethodImpl(MethodImplOptions.NoInlining)]
. In that case, JIT is able to recognize NextOffset
as inlineable without any special hint and the inlined code remains small.
Ok, another question: Which runtime was used for the above performance testing? .NET 7 changed a lot of stuff.
.NET 6.0.11 (6.0.1122.52304), X64 RyuJIT AVX2
I'm not using .NET 7 on my dev machine nor DevOps hosted agents. I think that the most people (including the company where I'm working) prefer to use LTS in production environment.
It depends very much on the company. If I were to guess, you would probably find at least as many that run ancient, random, unsupported versions of .NET as ones that favor LTS versions (nobody really cares or even knows until the duct tape and baling wire mess running on the Windows 2008 Server box under the receptionist's desk blows up). For those in more regulated industries or with systems that are difficult to upgrade (say, because it runs in an isolated box in the middle of nowhere), then yes: LTS builds. However, for the "every last CPU cycle" crowd, they often like to live on the bleeding edge. (Not that this is .NET framework specific in any way. It could just as well be PHP or Java or what-not.)
Did the inlined version make a noticeable impact on performance?
BTW, have you seen Lemire's SIMD JSON talk? Since then, they've added AVX-512 support.
https://simdjson.org/ There's a C# port, but it doesn't look like it's being maintained.
Did the inlined version make a noticeable impact on performance?
Yes, 30% performance boost.
BTW, have you seen Lemire's SIMD JSON talk? Since then, they've added AVX-512 support.
.NET doesn't have AVX-512 support but it is planned in 8th version. Anyway, I see no way how we can use AVX to boost random string generation. Lemier's algorithm currently implemented for general case (when allowedInput.Length
is not a power of 2) cannot be vectorized due to the loop.
https://github.com/dotnet/dotNext/blob/24b6f07def13e008a91c4f6455ba2b686d93cf77/src/DotNext/RandomExtensions.cs#L45
Using the remainder modulo the length will generate biased results when
(uint)allowedChars.Length
is not a power of two. The version based onRandom
might not be terribly important, but if someone is deliberately using the cryptographicRandomNumberGenerator
version, then one should probably make every effort to be as accurate as possible.I haven't looked at what
System.Random
uses for its boundedNext
implementations, but perhaps there is already an unbiased version there?Here's Daniel Lemire's fast, unbiased version—albeit with an example implementation for 64-bit integers: Nearly Divisionless Random Integer Generation On Various Systems. The ACM paper linked in his blog post goes into more detail.
Granted, the execution time will almost certainly be dominated by calls to
RandomNumberGenerator
, which would be aggravated by an unbiased algorithm needing to discard some of the output fromrng.GetBytes(...)
. If that is a problem, one might want to handle the case whereallowedChars.Length
is less than 256, since that could be handled with 8-bit integers from the generator instead of the general case's 32-bit integers.