dotnet / dotNext

Next generation API for .NET
https://dotnet.github.io/dotNext/
MIT License
1.62k stars 122 forks source link

Biased random ".NextString(...)" #138

Closed henricj closed 1 year ago

henricj commented 1 year ago

https://github.com/dotnet/dotNext/blob/24b6f07def13e008a91c4f6455ba2b686d93cf77/src/DotNext/RandomExtensions.cs#L45

Using the remainder modulo the length will generate biased results when (uint)allowedChars.Length is not a power of two. The version based on Random might not be terribly important, but if someone is deliberately using the cryptographic RandomNumberGenerator version, then one should probably make every effort to be as accurate as possible.

I haven't looked at what System.Random uses for its bounded Next implementations, but perhaps there is already an unbiased version there?

Here's Daniel Lemire's fast, unbiased version—albeit with an example implementation for 64-bit integers: Nearly Divisionless Random Integer Generation On Various Systems. The ACM paper linked in his blog post goes into more detail.

Granted, the execution time will almost certainly be dominated by calls to RandomNumberGenerator, which would be aggravated by an unbiased algorithm needing to discard some of the output from rng.GetBytes(...). If that is a problem, one might want to handle the case where allowedChars.Length is less than 256, since that could be handled with 8-bit integers from the generator instead of the general case's 32-bit integers.

sakno commented 1 year ago

@henricj , that's true, the current implementation produces biased random value. Random.Next uses unbiased version. We might to do the same for cryptographic RNG. Feel free to make a PR.

sakno commented 1 year ago

Primary idea was to touch RNG once outside of the loop to keep perf overhead minimal.

sakno commented 1 year ago

There is unbiased version of NextUInt32 that can be introduced to generate indexes:

static uint NextUInt32(this RandomNumberGenerator random, uint maxValue)
    {
        Debug.Assert(random is not null);

        Unsafe.SkipInit(out uint source);
        random.GetBytes(Span.AsBytes(ref source));
        var m = (ulong)source * maxValue;

        var low = (uint)m;
        if (low < maxValue)
        {
            for (var t = unchecked(~maxValue + 1U) % maxValue; low < t; low = (uint)m)
            {
                random.GetBytes(Span.AsBytes(ref source));
                m = (ulong)source * maxValue;
            }
        }

        return (uint)(m >> 32);
    }

Main disadvantage of this approach: NextBytes must be called N times where N is a length of output string.

henricj commented 1 year ago

Yeah, the CPU time is going to be dominated by calls to random.GetBytes() (this will likely be something along the lines of NIST's CTR_DRBG, which will get re-seeded for every call and produce output in 128-bit blocks). Generating a batch of random values and passing them instead of a RandomNumberGenerator instance would be faster, but messy since the number of needed random values is not known ahead of time (C# 11's new scoped keyword might even come into play).

The simplest solution is to call RandomNumberGenerator.GetInt32(Int32), at least where requiring .NET Standard 2.1 or .NET Core is okay.

sakno commented 1 year ago

According to this paper, nearly divisionless method is very suitable for our case. The length of random can be 4 X output string length bytes, because if branch in this listing is very unlikely.

The probability of using no division is (2^L - s)/2^L, which is very close to 1 in our case for smaller s (max value in the range). L is 32. Typically, the user don't want to have random strings with hundreds of thousands or even millions characters.

sakno commented 1 year ago

As a result, we can have two branches of execution:

  1. If allowedChars.Length is a power of 2. This is the most optimized case
  2. Otherwise, populates a random vector of size 4 X buffer.Length and just advances the pointer inside of each iteration of the loop. If pointer overflows (which is very unlikely according to prev analysis), generates a new vector in sets pointer to its initial position (zero).
henricj commented 1 year ago

I did a quick profiler run with the above NextUInt32() suggestion and over 90% of the time is spent generating those cryptographic random numbers. That suggests minimizing the total number of calls to RandomNumberGenerator.GetBytes() (or .Fill()).

A couple of observations:

  1. I would guess that most of the time this will be used for generating tokens of ASCII characters. As such, the length of the string will almost always be less than 256 (62 for [A-Za-z0-9] might not be unusual). As such, one might consider special-casing generating bounded 16-bit numbers; it might be faster than a more general uint path.
  2. At least the initial multiplication part of the algorithm might benefit from SIMD.
  3. A special case for a power of two might or might not save more time than the test for that power of two.
  4. The size of the pre-computed vector of random numbers should probably be rounded up to the nearest multiple of the generator's output size (likely 16-bytes). Asking for less data will not save many CPU cycles (if any, especially if the RNG implementation itself has a special case for handling the last output bit that doesn't fit a whole block of output) and the extra random values might turn out to be useful when bad luck triggers the rejection path, thereby further reducing the odds that another call to RandomNumberGenerator.GetBytes() will be needed.
  5. It might be interesting to compare the Lemier algorithm's performance to the current BCL's Random. There might be a reasonable patch there...

All that being said, Knuth's warning about premature optimization being the root of evil comes to mind. (At least, the above stuff is for fun; I wouldn't charge a paying client for such unless there was good indication that it would address a known performance bottleneck.)

henricj commented 1 year ago

Oh, and a point 6:

  1. A security review would probably insist on any buffers or variables used to handle secrets being zeroized after use. With optimizing compilers and (in this case) managed memory, I'm not really sure how effective that is in reality, but security audits can insist on there being a nominal attempt in the code anyway (however amusingly inadequate the optimizer might find it).

C23 is getting "memset_explicit()" to try to deal with it.

sakno commented 1 year ago

Oh, and a point 6:

6. A security review would probably insist on any buffers or variables used to handle secrets being zeroized after use.  With optimizing compilers and (in this case) managed memory, I'm not really sure how effective that is in reality, but security audits can insist on there being a nominal attempt in the code anyway (however amusingly inadequate the optimizer might find it).

C23 is getting "memset_explicit()" to try to deal with it.

This is the responsibility of the caller to cleanup the buffer of characters. But if the caller prefer to use string version, we have no chance to cleanup anything.

Check out d6c798cf6dd5d9d6b805791944cc87509d312e23 commit, GetBytes is outside of the loop and it is very low likelihood to call it again during generation of random numbers.

sakno commented 1 year ago

Pre-populated vector of random values gives significant performance gain for pseudo-random generator as well:

Before optimization:

Method Mean Error StdDev
RandomString 234.2 ns 1.86 ns 1.65 ns
GuidString 1,496.2 ns 12.01 ns 11.24 ns
CryptoRngString 1,854.8 ns 14.30 ns 12.68 ns

After:

Method Mean Error StdDev
RandomString 122.3 ns 2.06 ns 1.92 ns
GuidString 1,489.3 ns 8.69 ns 8.13 ns
CryptoRngString 1,830.1 ns 22.41 ns 20.96 ns
sakno commented 1 year ago

Fast branch when allowedChars.Length is a power of 2 is very efficient as well:

Method Length AllowedChars Mean Error StdDev Median
RandomString 16 1234567890abcdef 71.76 ns 0.841 ns 0.746 ns 71.56 ns
RandomString 16 1234567890abcde 177.23 ns 1.792 ns 1.676 ns 176.94 ns
sakno commented 1 year ago

Internal buffer cleanup is added (enabled by default), but can be disabled using DotNext.Security.DisableRandomStringInternalBufferCleanup app switch from AppContext class

henricj commented 1 year ago

You've been busy; it looks nice.

Only one question springs to mind: how would using an array of uint perform compared to current byte array approach? That would remove the need for Unsafe.ReadUnaligned(). (It might matter more on ARM64.)

henricj commented 1 year ago

Ok, another question: Which runtime was used for the above performance testing? .NET 7 changed a lot of stuff.

sakno commented 1 year ago

Ok, another question: Which runtime was used for the above performance testing? .NET 7 changed a lot of stuff.

.NET 6.0.11 (6.0.1122.52304), X64 RyuJIT AVX2

I'm not using .NET 7 on my dev machine nor DevOps hosted agents. I think that the most people (including the company where I'm working) prefer to use LTS in production environment.

sakno commented 1 year ago

You've been busy; it looks nice.

Only one question springs to mind: how would using an array of uint perform compared to current byte array approach? That would remove the need for Unsafe.ReadUnaligned(). (It might matter more on ARM64.)

Done. After all optimizations including removal of nuint <-> uint roundtrip inside of the loop:

Method Mean Error StdDev
RandomString 90.97 ns 0.407 ns 0.361 ns
GuidString 1,504.66 ns 29.490 ns 33.961 ns
CryptoRngString 1,825.52 ns 9.935 ns 11.042 ns
sakno commented 1 year ago

Added extra optimizations dictated by JIT behavior. RyuJIT prevents inlining of methods containing loops. In our case, the following loop:

if (low < maxValue)
        {
            for (var t = unchecked(~maxValue + 1U) % maxValue; low < t; low = (uint)m)
            {
                random.GetBytes(Span.AsBytes(ref source));
                m = (ulong)source * maxValue;
            }
        }

probably will never be executed. However, it prevents inlining of entire method which is called inside of another loop by the caller. Thus, every iteration has overhead needed to call the method (save registers, prologue, etc.).

The first idea is to force inlining by [MethodImpl(MethodImplOptions.AggressiveInlining)] attribute applied to NextOffset method. However, it is also bad because the loop will be inlined too. Instead, the better approach is to place rejection code with that loop to the separated method and forbids its inlining using [MethodImpl(MethodImplOptions.NoInlining)]. In that case, JIT is able to recognize NextOffset as inlineable without any special hint and the inlined code remains small.

henricj commented 1 year ago

Ok, another question: Which runtime was used for the above performance testing? .NET 7 changed a lot of stuff.

.NET 6.0.11 (6.0.1122.52304), X64 RyuJIT AVX2

I'm not using .NET 7 on my dev machine nor DevOps hosted agents. I think that the most people (including the company where I'm working) prefer to use LTS in production environment.

It depends very much on the company. If I were to guess, you would probably find at least as many that run ancient, random, unsupported versions of .NET as ones that favor LTS versions (nobody really cares or even knows until the duct tape and baling wire mess running on the Windows 2008 Server box under the receptionist's desk blows up). For those in more regulated industries or with systems that are difficult to upgrade (say, because it runs in an isolated box in the middle of nowhere), then yes: LTS builds. However, for the "every last CPU cycle" crowd, they often like to live on the bleeding edge. (Not that this is .NET framework specific in any way. It could just as well be PHP or Java or what-not.)

henricj commented 1 year ago

Did the inlined version make a noticeable impact on performance?

BTW, have you seen Lemire's SIMD JSON talk? Since then, they've added AVX-512 support.

https://simdjson.org/ There's a C# port, but it doesn't look like it's being maintained.

sakno commented 1 year ago

Did the inlined version make a noticeable impact on performance?

Yes, 30% performance boost.

BTW, have you seen Lemire's SIMD JSON talk? Since then, they've added AVX-512 support.

.NET doesn't have AVX-512 support but it is planned in 8th version. Anyway, I see no way how we can use AVX to boost random string generation. Lemier's algorithm currently implemented for general case (when allowedInput.Length is not a power of 2) cannot be vectorized due to the loop.