Request for hardware based rng, and integration with vector struct.

Xyncgas commented 2 years ago

Instead of using Random.nextbytes, which has several down side : Array lives in the heap System.Random is slower than hardware noise.

I would like a hardware based rng that takes a vector of two bytes and mutate it. it would speed up my program.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-numerics See info in area-owners.md if you want to be subscribed.

Issue Details

Instead of using Random.nextbytes, which has several down side : Array lives in the heap System.Random is slower than hardware noise. I would like a hardware based rng that takes a vector of two bytes and mutate it. it would speed up my program.

Author:	Xyncgas
Assignees:	-
Labels:	`area-System.Numerics`, `untriaged`
Milestone:	-

GrabYourPitchforks commented 2 years ago

Are you absolutely sure this would speed up your program, especially if you're generating 2 bytes at a time? A quick web search shows that rdrand's performance is inferior to nearly every modern software based RNG out there.

What exactly are you trying to do?

danmoseley commented 2 years ago

Array lives in the heap

Random.NextBytes does not allocate on the heap. It writes to the array you pass in. Don't you need that same array whatever random implementation you use?

Xyncgas commented 2 years ago

Are you absolutely sure this would speed up your program, especially if you're generating 2 bytes at a time? A quick web search shows that rdrand's performance is inferior to nearly every modern software based RNG out there.

What exactly are you trying to do?

It's for research. training AI model

Xyncgas commented 2 years ago

Yes the program is taking too long to train model while we are just generating random number then comparing it, there's other operations but I doubt it comes anywhere to the System.Random itself.

Xyncgas commented 2 years ago

Array lives in the heap

Random.NextBytes does not allocate on the heap. It writes to the array you pass in. Don't you need that same array whatever random implementation you use?

If I can use Random.NextByte on a stack allocated array, then forget about generating a random vector of two bytes, otherwise using heap allocated array I don't think is faster than reading my 2bytes from stack even that array is already allocated in the heap and I am just trying to read it.

I am trying to optimize this part because it is absolutely what we are doing, as you can imagine because we are training AI model there's a lot computation and I am not we are not doing anything we don't need to do.

We are at the point of calling System.Random.Next() once, then take 2 bytes from this integer so we are avoiding calling it twitce. Still, we are focusing optimizations on this major part where we are needing 2 bytes and then mutate them using randomness.

that is why I wanted to make sure the bytes are in the stack, because it's faster than creating byte[2] or reading byte[2] that's in the heap and we might need to actually create byte[2] many times instead of reusing memory because we are parallelizing the operations, also that is why I wanted to have hardware based rng in hoping that it would be faster than System.Random while not loosing the quality of randomness which we believe would only make our model longer to train.

Xyncgas commented 2 years ago

I am also aware that there is GPU based rng, probably in Nvida's cuda that we can use to train our models, but we are working on a new way to train that's why we aren't using existing frameworks like tensor flow, the ai we are making is based on a new theory that we are currently researching, that's why we are looking for various options since we have to make our own framework from the scratch. Which includes the possibility of not using GPU but if we can get it running in .NET, we see a future where we can embed AI in browser or some specialized devices that has a lot of hardware threads to run it it's what I like about dotnet So right now I want to see, if we get this part right, how fast can the AI run.

Xyncgas commented 2 years ago

Are you absolutely sure this would speed up your program, especially if you're generating 2 bytes at a time? A quick web search shows that rdrand's performance is inferior to nearly every modern software based RNG out there.

What exactly are you trying to do?

I also get a hint that part of what you are trying to say is we are creating 2 bytes that might be sounding too underwhelming, yet we don't buy that generating 256 bytes at once in the stack using some magic function that gives us random bytes, is faster than generating it when we need it because it's just math, which is the same operations that needs to be executed to compute a random number, but I might buy that if someone tells me they know the algorithm in System.Random really well and running it 256 times together is faster due to branches or stuffs in the implementation of the algorithm, it's not like we are trying to allocate a block of memory, which is faster if you allocate a big block of memory than many small ones, since we are assuming that these bytes will live on the stack and die on stack.

I do need to do this many times, let's just say currently it takes ~1 days at 1 million cycles/sec of what we are doing (I actually ran our prototype, both the result from running it and statistics of our algorithm gives us this number, which can be faster if we run it on faster cpu but we get a feeling that this is at about where things are kinda taking a while but we haven't done all the optimizations and we can get it down even lower, at which we can get the model trained. I am not making things up we we've found an answer to Turing's Halting Problem we are actually working on a paper that's another kinds of AI that opens up many possibilities & use cases that needs someone to point to us about how to do these software sides of things so badly, and I am asking the questions to the dotnet people, based on my observations after analysing our implementations that that is what we need). which isn't bad in my opinion, we haven't made it parallel and there's other optimizations we haven't done, but this is one of the big parts that we need to make sure it's fast but we can't find any way to do it in .NET : because I am thinking System.Random is too slow for training AI hardware based random number generators might be faster, if we can use vector then there's a possibility we can use SIMD somewhere.

Xyncgas commented 2 years ago

I've hided my answer to your questions because they are out of topic of this feature, which're my answers to why I need this.

stephentoub commented 2 years ago

If I can use Random.NextByte on a stack allocated array

You can use NextBytes with a span, which can be stack allocated, alias some other existing memory, etc.

Random was also rewritten for .NET 6 and is much faster than it was in previous releases.

we don't buy that generating 256 bytes at once in the stack using some magic function that gives us random bytes, is faster than generating it when we need it because it's just math, which is the same operations that needs to be executed to compute a random number, but I might buy that if someone tells me they know the algorithm in System.Random really well and running it 256 times together is faster due to branches or stuffs in the implementation of the algorithm, it's not like we are trying to allocate a block of memory, which is faster if you allocate a big block of memory than many small ones, since we are assuming that these bytes will live on the stack and die on stack.

It's faster to do more at the same time because the underlying algorithm works on values larger than 2 bytes as the core primitive; a single cycle can end up processing more bytes at a time. If you only end up using two bytes of the result, you're throwing away pieces of the computation. This isn't theoretical: here's a simple benchmark you can try:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;

public partial class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    private Random _r = new Random();
    private byte[] _buffer = new byte[256];

    [Benchmark]
    public void Bytes2() => _r.NextBytes(_buffer.AsSpan(0, 2));

    [Benchmark]
    public void Bytes256() => _r.NextBytes(_buffer.AsSpan(0, 256));
}

On my machine, I get:

Method	Runtime	Mean	Error	StdDev	Ratio
Bytes2	.NET 5.0	18.063 ns	0.0458 ns	0.0429 ns	1.00
Bytes2	.NET 6.0	4.575 ns	0.0135 ns	0.0120 ns	0.25

Bytes256	.NET 5.0	2,008.601 ns	1.6781 ns	1.4013 ns	1.00
Bytes256	.NET 6.0	34.089 ns	0.0414 ns	0.0367 ns	0.02

So when getting 256 bytes at a time, we're paying 0.13ns per byte, and when getting 2 bytes at a time, we're paying 2.29ns per byte, making it ~18x faster to get 256 bytes at a time. On top of that, getting 256 bytes on .NET 6 is ~60x faster than it was in .NET 5.

Xyncgas commented 2 years ago

Thanks for the insightful information. I had no idea .NET 6 used Xorshiro256** for System.Random, of course this would be fast, and of course since I know this algorithm it uses UINT64 internally which explains that it's just wasting bytes if I just throw it away after generating two bytes. I should close this issues for now, before that I am interested in knowing, what's your opinion for the reason System.Random changed its algorithm regarding the willingness to break compatibility (when calling seed() in different .net versions generates the different results of random number sequence) just for performance And since I can use stack allocated bytes with System.Random I am feeling so happy.

stephentoub commented 2 years ago

what's your opinion for the reason System.Random changed its algorithm regarding the willingness to break compatibility (when calling seed() in different .net versions generates the different results of random number sequence) just for performance

It doesn't. If a seed is provided, it falls back to using the same algorithm previously used.

dotnet / runtime

Request for hardware based rng, and integration with vector struct. #62110