dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

https://docs.microsoft.com/dotnet/core/

MIT License

15.25k stars 4.73k forks source link

Add 64 bits support to Array underlying storage #12221

Open GPSnoopy opened 5 years ago

GPSnoopy commented 5 years ago

While System.Array API supports LongLength and operator this[long i], the CLR does not allow arrays to be allocated with more than 2^31-1 elements (int.MaxValue).

This limitation has become a daily annoyance when working with HPC or big data. We frequently hit this limit.

Why this matters

.NET arrays is the de facto storage unit of many APIs & collections.
Currently one has to allocate native memory instead, which does not work with managed code (Span could have helped, but it's been limited to int32 as well).
HPC frameworks expects data to be contiguous in memory (i.e. the data cannot be split into separate arrays). E.g. BLAS libraries.
Requires applications to be coded and designed differently if they handle <2G elements or >2G elements. I.e. It does not scale.
It's an arbitrary limitation with little value to the user and application:
- With a desktop with 64GB, why can one only allocate 3.1% of its memory in one go?
- With a data centre machine with 1,536GB, why can one only allocate 0.1% of its memory in one go?

In C++ this is solved with std::size_t (whose typedef changes depending on the target platform). Ideally, .NET would have taken the same route when designing System.Array. Why they haven't is a mystery, given that AMD64 and .NET Framework appeared around the same time.

Proposal I suggest that when the CLR/JIT runs the .NET application in x64, it allows the array long constructor to allocate more than int.MaxValue items:

Indexing the array with operator this[long i] should work as expected and give access to the entire array.
Indexing the array with operator this[int i] should work as expected but implicitly limit the access to only the first int.MaxValue elements.
The LongLength property should return the total number of elements in the array.
The Length property should return the total number of elements in the array, or throw OverflowException of there are more than int.MaxValue elements (this matches the current behaviour on multi-dimensional arrays with more than int.MaxValue elements).

I naively believe that the above should not break any existing application.

Bonus points for extending 64-bit support to Span and ReadOnlySpan.

SupinePandora43 commented 2 years ago

I kinda support breaking backwards compatibility as it confuses writing modern code. But i think it's possible to keep binary compatibility while using native-sized integers for Length properties. Here's my idea: compiler support - automatically replace Length call with hidden UnsignedNativeLength . so, compiler will compile:

public ref struct Span<T> {
    private readonly nuint length;

    public nuint Length { get => length; }
}

public ref struct Span<T> {
    private readonly nuint length;

    // Compiler visible
    public nuint UnsignedNativeLength { get => length; }
    // Binary visible only
    public int Length { get => (int) Math.Min(int.MaxValue, UnsignedNativeLength); } // use some tricks here
}

And

nuint length = span.Length;

nuint length = span.UnsignedNativeLength;

Pros

Binary compatible
Length returns nuint

Cons

Performance

Cast performance (nuint -> int, int -> nuint)

Compatibility

All previously-written code won't be able to use more than int.MaxValue items, unless use nuints and being recompiled.
All previously-written code will have warnings "nuint Length->int cast"

IL

Strange nuint Length to nuint UnsignedNativeLength conversion
Compiler and IDE support to report Length as nuint type

(You can replace nuint in my code with nint, but i think we should use nuints because Length can't be less than 0.)

This can be implemented for researching on some roslyn branch.

hez2010 commented 2 years ago

I think this is becoming increasingly urgent. We even hit this issue in the compiler: https://github.com/dotnet/runtime/issues/66787

philjdf commented 1 year ago

Would be really great if something happened with this in 2023. A completely separate BigArray<T> implementation would seem to be a good way forward.

hez2010 commented 1 year ago

I recently hit this issue again and again when I tried to use dotnet for HPC and ML, and I had to give up and switched to python to do such thing as a result. This issue definitely has been a showstopper for users who want to use dotnet for scientific/ML/computation purpose.

/cc: @jkotas @tannergooding

houseofcat commented 1 year ago

I would like to hear an update on this if possible.

TheGuy920 commented 1 year ago

Damn, 3+ years later and this is still open... very sad. Let me at least have my Int64 arrays before ChatGPT takes over :(

tannergooding commented 1 year ago

Its not a trivial problem and one that may not see the benefits you'd think from adding support.

Having such huge allocations basically necessitates the data be immovable and be unmanaged. In which case having some NativeSpan would provide more overall benefit than allowing a T[] to be native sized. The main benefit from GC tracking would then simply be automatic allocation cleanup when the lifetime ends.

Once data gets that large, you start having to consider machine limitations (many machines won't have that much memory), whether having all that memory paged in at once is beneficial, and whether the entire algorithm/data layout should be refactored to better take advantage of such a large data set. At which point, you can often refactor the data to be streamed or chunked so that it can be more effectively (and often more efficiently) handled

neon-sunset commented 1 year ago

Although introducing changes to Array is likely impossible, NativeSpan or adjusting Memory and Spans to be nuint-sized (especially with spans, they are two registers either way) would really help! Specifically, if either option offers most if not all existing APIs available to regular spans (in case of NativeSpan variant). Plus, some of the span-based methods in CoreLib are just int->nuint change away from supporting these (since spans are often unpacked into ref + length).

Also, I would like to highlight a problematic sentiment expressed in this and other related discussions: It is unfortunate that the argument "users do not need this much memory" still lives to this day, because it seems like it is driven by the fact that arrays cannot be changed and appeals to a status quo as a reasonable justification, which is not correct.

A lot of consumer systems now have 32GB of RAM, reading array even multiple times large than 2GB is not uncommon for certain applications. For example LLaMA(2).cpp ports to C#, even those that use NativeMemory for the buffer of the model itself, are forced into code mostly relying on pointers, which leads to much worse developer experience where they could have stayed with idiomatic (and safe!) span-based code otherwise.

Arguably, this is a toy example, however, commercial code will face similar challenges in the domain of ML, which it seems .NET now cares about a lot (with TensorPrimitives, Microsoft itself heavily pivoting into the sector, etc.).

It would be great for the community to know if there is any work or at least discussions planned to address this for .NET 9 milestone. Thanks!

tannergooding commented 1 year ago

Also, I would like to highlight a problematic sentiment expressed in this and other related discussions: It is unfortunate that the argument "users do not need this much memory" still lives to this day, because it seems like it is driven by the fact that arrays cannot be changed and appeals to a status quo as a reasonable justification, which is not correct.

My own sentiment isn't that users don't need this much memory. I believe it is fairly common for various apps to have working sets well above 2GB in certain scenarios (games, imagine editing, machine learning, etc). Nor are .NET arrays themselves limited to 2GB.

We could go back and forth for days on different views around what "consumer systems" refers to and what the target audience is. Different domains (gamers vs casual users vs developers vs cloud providers vs laptop/tablet users) will all differ. At least for gamers, Steam Hardware Survey shows that the vast majority is still 16GB with 32GB being marginally ahead of 8GB.

For some code, cutting out these lower end machines will be acceptable; for others it won't. However, I don't think arguing around prevalence of that amount of memory is mainstream is a key factor here. Rather, instead, I think its worth considering what it means for the GC and what it means to work with such big data. -- Noting its not 2GB, but rather 2 billion elements. This is 2GB for byte/sbyte; but it is 8GB for float/int/uint. It's also 2 billion elements per dimension, so a mult-dimensional byte array can itself be more than 2GB of contiguous allocation already.

I think the most important factor is what it means for a typical algorithm to operate on the data; it's mostly simply put that working with small amounts of data and working with large amounts of data is very different. Your L1 is slower than accessing register, your L2 is slower than that, your L3 is slower than that and may have sharing considerations with other core complexes/dies, and your RAM is even slower than that. Each step is often at least 2 times (if not more) slower than the last.

When working with small amounts of data, these differences often don't matter. That is, it's not typically the bottleneck in your application and things work well without a lot of additional consideration. However, when you switch over to working with large datasets, the more naive algorithms start showing their weakness and can quickly become bottlenecked on memory. This is particularly the case if you're trying to load everything up front into one allocation (as one might due for the naive implementation).

Because of this, such systems often start to look at options such as buffering, performing non-temporal loads/stores, or even parallelizing the workload across many cores or other devices (such as the GPU). Once you start looking at these other options, you're typically no longer bound by requiring a single allocation and instead can spread the data across many different allocations and distribute the work.

Now, what is defined as "large" vs "small" is very dependent. Some examples are that pages are often 4kb and this then matches what a typical sector size is for many modern file systems (although some do go larger). Many GPUs have a required buffer alignment of 65kb, and many modern memcpy implementations start using non-temporal load/store around 256kb. The GC currently treats allocations more than 85kb as "large" and puts them on the large-object heap where they then get special considerations. Many file stream implementations default to a 2kb buffer for streaming data off the disk, in part due to the latency of spinning hard drives.

But the general point is that by looking towards these scenarios, you can restructure your code in a way that completely removes the requirement to support allocations that are more than 2 billion elements in length. That can in turn improve robustness and performance, can reduce the total working set of your application, and allow better scalability or distribution of your work.

So its not that I think users don't need to work with big data, but rather that providing support for very large allocations is providing a bad solution for the types of scenarios that might need to work with such data and it will ultimately hurt the ecosystem more than it will help it. I instead think we'd be much better off by looking at how to make it easier for devs to adjust their data and write their algorithms in a way that supports the features that are beneficial to large data. How to make it easier to work with things like SequenceReader, how to improve the ability to work with sparse data sets efficiently, how to more easily parallelize their computation across cores or devices, etc.

------ Noting again, this is my own sentiment on the topic. It may not be shared with the general .NET team and everyone may have their own views on this.

KalleOlaviNiemitalo commented 1 year ago

It's also 2 billion elements per dimension, so a mult-dimensional byte array can itself be more than 2GB of contiguous allocation already.

True, but a multidimensional array currently cannot have more than 2³²-1 = 0xFFFFFFFF elements total, e.g. new T[0x10001, 0xFFFF].

https://github.com/dotnet/runtime/blob/a6dbb800a47735bde43187350fd3aff4071c7f9c/src/coreclr/vm/gchelpers.cpp#L581 https://github.com/dotnet/runtime/blob/a6dbb800a47735bde43187350fd3aff4071c7f9c/src/coreclr/vm/gchelpers.cpp#L598-L600

MineCake147E commented 1 year ago

Your L1 is slower than accessing register, your L2 is slower than that, your L3 is slower than that and may have sharing considerations with other core complexes/dies, and your RAM is even slower than that.

Even in DDR3-1600, a single channel can transfer multiple GBs of data per second, or sometimes even faster with sequential access. Most computers have 2 memory channels, but sometimes even more. Even if the memory bandwidth were a main bottleneck, you can theoretically process billions of elements in a single second. At that speed, lower-end CPU would become a major bottleneck instead. Introducing some data-structure overhead makes it even worse.

En3Tho commented 1 year ago

Although I agree with @tannergooding on using optimized algorithms and that huge array allocations might be in fact not beneficial at all, as a user I still wish for it to just work. It's a cool feature of .Net: it has so much out of the box that just works.

If this is a valid reason for people to drop porting or implementing ML libraries then it's a thing that needs to be addressed.

The user at least should have an ability to just Google "how to allocate a huge array in .Net" and have a good solid answer in the top search results.

Obviously a dedicated type for this kind of thing would be a best choice I guess.

kasthack commented 1 year ago

However, when you switch over to working with large datasets, the more naive algorithms start showing their weakness and can quickly become bottlenecked on memory

This thread is dedicated to 64-bit arrays, but 64-bit length issue extends to other collections as well, and .NET doesn't provide 64-bit-compatible interfaces, so one can't make their own optimized implementation without breaking interoperability with the existing code. ICollection<T>.Count is 32-bit, as well as IList[int]. While modifying Array can be problematic, interfaces absolutely need change to give the users some option that doesn't require reimplementing half of the BCL just to work with large collections.
While naive algorithms may have performance issues with the larger datasets,
- they work and they are available now. Is outright denying the users to apply them instead of adding a remark to the documentation the right solution? Performance-conscious teams are going to benchmark the code and use custom implementations if needed, while the rest can have code that works instead of having to reinvent the wheel.
- they are still thoroughly tested for correctness and finely tuned which is a rather high baseline. An average developer that just needs things done isn't going to beat that.
- the cost of implementing tailored algorithms instead of using simple ones frequently greatly outweighs hardware costs, especially for one-off / infrequently run tasks. A reserved 2 TiB RAM VM costs $4/hour on Azure and AWS. That's 1-5 minutes of developer work in the first world.

providing support for very large allocations is providing a bad solution for the types of scenarios that might need to work with such data and it will ultimately hurt the ecosystem more than it will help it.

ML libraries need large arrays anyway, and currently the developers have to resort to using native memory and pointers, which makes things more complicated than they should be and actually creates tech debt.

tannergooding commented 1 year ago

Abstractly speaking, the problem presented in this thread is "dev's are having a hard time working with big data". There is then a statement being made that this is because System.Array (and most collection types) is limited to 2 billion elements.

It is ultimately the .NET teams responsibility to take a view of the problem space and to find the right solution; not simply to give users exactly what they asked for. Often times what's being asked for and what's provided do line up; other times they don't and what users thought they needed isn't quite right. Determining that requires investigation, discussion, and cooperation between both parties.

So what I'm presenting is that fixing the abstract problem by providing 64-bit arrays may indeed address things. However, providing exactly that has numerous complexities, likely doesn't solve the underlying issue, and may introduce several new problems that may make the ecosystem overall worse off.

I believe that instead, it would be more beneficial (both short and long term) to look at ways .NET can make it easier to work with big data in a "better" way. For example, what could be done to make it easy for users to chunk/stream their data while making it appear to the user as if it were contiguous (for both reading and writing)?

By looking at how we could address the abstract problem presented and in a way that ensures users good a good experience when working with large data. Not only do we solve the issue, but we do it in a way that benefits everyone and simply makes .NET that much better.

interfaces absolutely need change to give the users some option that doesn't require reimplementing half of the BCL just to work with large collections.

This is one of those "underlying issues" with exposing big arrays. If you extend System.Array to support 64-bit lengths, you now have to go and touch every other type in the BCL (and eventually across the ecosystem) to also support 64-bits (that's a lot of code to audit since every for (int i = 0; i < values.Length; i++) is now potentially incorrect). You still have binary compat to consider and so the existing 32-bit APIs will remain which means that users, especially the average user who doesn't need to work with big data now has to deal with the potential for 20 years off existing APIs and for loops to now completely change how they're being written/handled.

Because of back-compat, we can't simply change things to have the "new signature". At best, you get things like ICollection having a default interface method: long LongCount => Count; (or nint NativeCount => Count;). APIs like the indexer on IList would end up having to default to a checked conversion and thus throwing/having higher implicit cost until derived types are updated.

Developers creating large arrays then don't have a way to determine, at a glance, whether a given API that takes T[], Span<T>, List<T>, or any other collection type actually supports the long lengths or if it will throw. This will lead to a very large amount of friction in the ecosystem and pain for developers.

Most of these problems are addressed by providing a separate type NativeArray/LargeArray/LongArray/WhateverIsTheBestNameArray. This, however, also then requires corresponding NativeSpan<T>, NativeList<T> and so forth. APIs then need new overloads that take these types and while a NativeSpan<T> would simplify a lot of this since most things could defer to that implementation, it would still take time to migrate.

That being said, you're then still left with this not being an ideal way to work with large data. You're left with the fact that pushing users to work with data like this will lead to other bugs around perf that can't be fixed by simply implementing better SIMD based algorithms. You're left with bugs that will come in around the working set of the app being too high. You're left with enterprises raising concerns around the cost of training their models using .NET in the cloud and people raising concerns around power consumption/carbon footprint.

So to me:

Extending existing types is convenient, but the worst option for the ecosystem
Exposing new Native* types that mirror the existing types is a much better solution, but comes with its own problems

And so, investigating if we can solve the problem by making it easier to work with big data in a "better" way first is the right next step. It may be that we ultimately determine one of the other two is the "right" direction. But we can't simply do that without first trying to address the real problem that exists.

Xyncgas commented 1 year ago

Might as well remove array size limit, leave abstract array that behaves like something that can be enumerated for any amount of elements inside and can write to it like a stream and an array, and depending on how it's use .NET selects the optimized underlying mechanism for the array

MineCake147E commented 1 year ago

Extending existing types is convenient, but the worst option for the ecosystem

I agree with this.

That being said, you're then still left with this not being an ideal way to work with large data.

I don't agree with this.

It doesn't justify the lack of solution for native interop needs. Someone will end up needing Native(ReadOnly)?(Span|Memory)<T> anyway.
Perfect-looking solution isn't always good for everyone. Users should have the right to choose.
- The Fibonacci Heap is known to be efficient for large data, but it can't handle small data really efficiently.
Some data structures are built on top of a huge contiguous array.
Parallel processing doesn't always require the sources and destinations to be completely separated for each threads.
SIMD programs often benefit from contiguity. ReadOnlySequence-like solution could easily be a nightmare for me.
I personally think that the users should be responsible to optimize their algorithms anyway.

APIs then need new overloads that take these types and while a NativeSpan<T> would simplify a lot of this since most things could defer to that implementation, it would still take time to migrate.

I think that not all collections should migrate to NativeArray<T>. I need Native* to just exist. I could do my own job with Native* if they exist.

At least for gamers, Steam Hardware Survey shows that the vast majority is still 16GB with 32GB being marginally ahead of 8GB.

How about datacenters and supercomputers?

SupinePandora43 commented 1 year ago

And so, investigating if we can solve the problem by making it easier to work with big data in a "better" way first is the right next step. It may be that we ultimately determine one of the other two is the "right" direction. But we can't simply do that without first trying to address the real problem that exists.

To work with large, potentially unbounded data, we can replace Encoding.GetBytes(ReadOnlySpan<char> chars, Span<byte> bytes) with something like Encoding.GetBytes(Stream charInput, Stream byteOutput). This will reduce performance, but will allow users to work with large data sets.

Though it doesn't have to be in .NET standard library, it can be a separate nuget package with extensions.

KalleOlaviNiemitalo commented 1 year ago

Someone will end up needing Native(ReadOnly)?(Span|Memory)\ anyway.

As of https://github.com/dotnet/runtime/pull/71498, Span\ is defined using a ref field and no longer uses the internal ByReference\, so a third party could now define NativeSpan\<T> etc. and publish to NuGet.

KalleOlaviNiemitalo commented 1 year ago

Encoding.GetBytes(Stream charInput, Stream byteOutput)

That can already be implemented as an extension method on top of Encoding.GetEncoder or Encoding.CreateTranscodingStream.

tannergooding commented 1 year ago

Notably I don't think the conversation is going to go much further. I've stated my piece of mind and have context into how these things work at the extreme scale and the local scale both. As well as how things like SIMD and perf oriented algorithms are best able to take advantage of memory; what happens when data is non-linear, etc.

I expect there's nothing I could say at this point to convince the remaining people that large arrays aren't really what people need. I've responded to a couple key points called out above, but will likely not engage in the thread much further.

SIMD programs often benefit from contiguity. ReadOnlySequence-like solution could easily be a nightmare for me.

SIMD programs benefit from linear data due to SIMD loading and operating on multiple pieces of data per instruction. Something like ROSequence doesn't change or remove that as it's not like you'd end up with a ROSequence that is 1-16 bytes per.

You'd reasonably chunk the data, such as into 2MB buffers (but you could make it larger or smaller based on needs). This allows you to still take advantage of features like non-temporal loads/stores, SIMD, and cache coherency. Without requiring too much data to be in linear memory at one time. If you really wanted to, you could break it into 2 billion element chunks and have it managed as at minimum 2GB sequences.

Once you get past a certain point (roughly 256kb), the "overhead" of chunking (which is effectively an operation over uncached memory at each chunk boundary) essentially becomes noise. That noise can be entirely mitigated with selective use of pre-fetch instructions as well.

How about datacenters and supercomputers?

These are both examples of systems that explicitly require chunking of their data.

These aren't single CPU systems. They are many CPU (and often many distinct computers) that are interconnected and managed by higher level software. They explicitly have to take into account the fact that memory is "non-coherent" and "non-uniform" (NUMA) and explicitly have to distribute the work accordingly.

They don't simply have a petabyte (or more) of memory available. Rather, they have a total of that much memory across all racks. Each rack holds n blades (effectively individual motherboards). Each blade holds 1 or 2 physical CPUs. Each physical CPU has x amount of memory.

For example, the Frontier supercomputer is 74 racks. Each rack is 64 blades (4736 total blades). Each blade is 2 nodes (9472 total nodes). Each node is 1 CPU w/ 4TB of RAM and 4 GPUs w/ 128GB of RAM each (9472 total CPUs w/ 37PB of RAM, 37888 GPUs with 473TB of RAM).

Blades are connected by high speed network switches and to access something in the RAM on another blade, you have to transfer it across the local area network to the address space of the local system. The OS for supercomputers has specialized software to help manage this and programs running on such computers are specially written to take advantage of that.

You cannot and do not treat the data the same as you would on a single CPU system because such setups are explicitly distributed.

GPSnoopy commented 1 year ago

It has been 3 years and a half since I originally opened this issue, and I'm fascinated that some of the comments in here are hard at work trying to justify why 64-bits indexable arrays are not needed. As pointed out by others, all .NET collections are limited to 2^31 elements, not just arrays. In fact, most of these collections are internally implemented on top of arrays. An array is a fundamental data structure in computer science, and it is one of the most optimal for many scenarios.

The world population is around 8 billion. 2^31 elements is not an outlandish, out-of-this-world number. This is not a discussion in a vacuum, most languages got this right 20-30 years ago. .NET made the initial mistake of ignoring years of experience in other languages and platforms, sticking to int32 instead of nuint for its collections. It is a bit insulting to both the people who design/use other languages, and to people who use .NET, to try so hard to convince them 64-bit arrays are not needed.

Few random facts from my personal experience:

In 2004, I was using an AMD Opteron (Athlon 64) to rescale and tile a 64K x 64K Earth NASA image for a 3D engine. This is more than 4 billion pixels. This was almost 20 years ago.
As of today, my laptop has 64GB of RAM. My personal desktop has 128GB.
As pointed by others, AWS (and other cloud providers) instances have multiples of these capacities.
AI models typically do use large arrays, as this is the optimal representation for dense matrices of neural network parameters. A lot of AI models still struggle to fit inside GPU memory of 40-80GB. People who work on AI models always want more.

Looking forward to coming back to this thread in another 42 months to witness stranger and even more extreme contorted mental gymnastics in comments explaining why 64-bit arrays are really not needed. In the meantime, I'll go back to C++, Rust, Python, or any other language that allows me to efficiently and quickly solve simple problems on ~large~ normal data.

neon-sunset commented 1 year ago

Just to add a little more context:

Languages that use machine-word length for array/collection primitives:

C++ (nuint)
Rust (nuint)
Go (nint)
Swift (nint)
Python (nuint with caveats)

Languages that use 32bit length for array/collection primitives or are limited to int.MaxLength:

C#, F#
Java, Kotlin, other JVM-based languages
JavaScript, TypeScript
PHP

Arguably, C# ambitions now lie much more in line with the first group of languages (aside from Python).

With that said, the following points stated in this discussion do make sense:

T[] arrays where T is object longer than int.MaxValue are problematic and can be taxing on GC
T[].Length, ICollection<T>.Count and this[int index] cannot be changed away from int because of backwards compatibility and no good tradeoff has been proposed so far that could address this comprehensively

What does not make sense or seems inappropriate for a performance-oriented general-purpose language like C#:

ReadOnlySequence-like and Stream-based APIs as a good workaround for the int-edness of Spans (this is not true)
Various forms of justifications how addressing with lengths and indexes above int.MaxValue can be problematic (this is not true either and has never been an issue in the languages in the first group)

A sentiment raised multiple times in the past was that, going forward, this will continue becoming more of a problem and the constraint will start getting hit by domains that aren't limited to large amounts of data like ML. Therefore, for the issue to get addressed in time, it is likely to be more productive for a consideration to be taken into account when designing new or revamping existing APIs today and not doing so might inflict damage to C# long-term viability as a general-purpose programming language.

GPSnoopy commented 1 year ago

@neon-sunset I agree.

One comment though:

T[] arrays where T is object longer than int.MaxValue are problematic and can be taxing on GC

They are no more, or less, problematic than any other Large Object Heap allocation, which starts at 85,000 bytes. I think that the problems of allocating large arrays on the LOH have been overblown in prior comments. Go and try for yourself, allocate 2 billions float64 in a single array (16GB) and do some simple math on it. As long as you have enough RAM, it will just work as you'd expect it to.

tannergooding commented 1 year ago

They are no more, or less, problematic than any other Large Object Heap allocation, which starts at 85,000 bytes

For unmanaged data, yes. For anything that includes managed data, then it increases as the total number of items increases due to additional references.

Various forms of justifications how addressing with lengths and indexes above int.MaxValue can be problematic

Certainly. Using nuint to index or as a length isn't problematic. It actually can be more efficient because it removes the need to zero/sign-extend things and we sometimes do tricks in the BCL to allow similar to happen in our own code.

The problem space comes in when interacting with and considering the 20 years of existing code that only expect int based lengths. As an example, almost every for loop based iterator could introduce problems without explicit changes to source. This is because they are written as for (int i = 0; i < array.Length; i++). Therefore, if the actual length is more then you will now throw on the first iteration (because Length has to do a checked truncation down to int, throwing on overflow). This makes it non pay-for-play and introduces a perf penalty for the common case and for existing code.

What does not make sense or seems inappropriate for a performance-oriented general-purpose language like C#:

ReadOnlySequence-like and Stream-based APIs as a good workaround for the int-edness of Spans (this is not true).

The consideration isn't that its a workaround, but rather it gives users access to the data as if it were linear, while allowing it to be non-linear behind the scenes.

As a challenge. Someone should feel free to write a scenario doing the "naive" thing described here. That is, use unsafe code to allocate a multi-GB linear allocation, and do some processing on it. Feel free to use SIMD. Feel free to make it do parallel processing across tasks/threads, etc.

I'd be happy to then take that and show you can achieve the same thing with non-linear allocations and have it perform just as well or faster. -- Nothing requires the data to be linear and when you start getting to working with big data, making it non-linear can often be crucial to allowing additional performance and better overall handling of the data.

A sentiment raised multiple times in the past was that, going forward, this will continue becoming more of a problem and the constraint will start getting hit by domains that aren't limited to large amounts of data like ML.

The sentiment can be boiled down to frustrations working with big data because .NET doesn't make it trivial today.

I stand by my own sentiment that this does not require truly linear allocations and reiterate the challenge I gave. The premise that you require singular allocations that are over 2 billion items in length doesn't hold up. Rather you simply need a way to efficiently pass around the total sum of allocations as if it were linear and to minimally account for this chunking in your actual algorithm.

MineCake147E commented 1 year ago

They don't simply have a petabyte (or more) of memory available. Rather, they have a total of that much memory across all racks.

But they still have more than 16GB of RAM per node as you explained later as:

For example, the Frontier supercomputer is 74 racks. Each rack is 64 blades (4736 total blades). Each blade is 2 nodes (9472 total nodes). Each node is 1 CPU w/ 4TB of RAM and 4 GPUs w/ 128GB of RAM each (9472 total CPUs w/ 37PB of RAM, 37888 GPUs with 473TB of RAM).

Supercomputers aren't always working as all nodes combined together. For example, Fugaku supercomputer charges a fee per Node-Hour.

You'd reasonably chunk the data, such as into 2MB buffers (but you could make it larger or smaller based on needs).

That's true for data processing pipelines that work in a pass-and-forget basis. However, some simulation workloads and FFT largely benefit from a linear array. If you chunk the data for such workloads, the algorithm goes much crazier than you think. It's because they read and write many data scattered across memory. That's why the gather/scatter instructions exist, and they typically have an argument for base address. It means the data shouldn't be scattered in multiple arrays, otherwise it'll be a nightmare for the programmer.

TheGuy920 commented 1 year ago

What if you just put long in front of the array definition so its like a double type definition (but not). As someone who knows very little about anything, this seems very intuitive to me. Then the whole backwards compatibility is not an issue. It would be the same as a LongArray<T> class but written as long T[] (because classes are stinky). Then BOOM. you just put long in front of any existing C# collections and you gets its long counterpart. Like I said I know very little about anything and it feels weird trying to provide feedback on a complex topic like this but I'm going to do it anyhow.

public long int[] my64bitarray;

.Length would be a long (not the same type as a normal array, so different properties and fields)

CurtHagenlocher commented 10 months ago

this does not require truly linear allocations

Echoing @MineCake147E, a linear chunk of memory is significantly easier to work with in many scenarios -- including interop with existing libraries. The specific case I'm interested in is Apache Arrow, where implementing C# support for some of the Arrow types is complicated significantly by lack of such support. A related case is the ability to reference parts of a memory-mapped file -- note that MemoryMappedViewAccessor already supports 64-bit sizes.

I do agree, though, that being able to allocate very large arrays on the managed heap and be able to treat them as arrays is a relatively unimportant part of the problem. The minimum bar is being able to allocate memory from the unmanaged heap and reference it from C# without gymnastics. This requirement could be met by defining the set of types LargeMemory<T>, LargeReadOnlyMemory<T>, LargeSpan<T>, LargeReadOnlySpan<T>, LargeMemoryManager<T> and ILargeMemoryOwner<T> to be the equivalent of their existing versions except using 64-bit offsets and lengths instead of 32-bit values. Doing so would not preclude implementing support for larger arrays at a future date.

(As I understand it, nothing stops me from defining these types today for private use -- but having them defined as part of the standard runtime is clearly better code sharing and reuse.)

Neme12 commented 7 months ago

If the change were made to T[] directly (rather than LargeArray<T> or similar), assemblies would need to mark themselves with something indicating "I support this concept!", and the loader would probably want to block / warn when such assemblies are loaded.

I don't think there's a reason for this. At most a warning. There's a reason that .NET Core started allowing referencing .NET Framework assemblies even though it was disallowed for some time because of compatibility concerns. You just can't update external dependencies, and there's a good chance they might work, so why not let the developer try it out. Also, NET Core takes breaking changes every release, and frameworks like ASP.NET Core or EF Core even much bigger breaking changes, and despite that, they've never disallowed referencing assemblies that were written for the previous version because of compatibility concerns. Not even a warning.

Neme12 commented 7 months ago

As of today, my laptop has 64GB of RAM. My personal desktop has 128GB.

quickly solve simple problems on ~large~ normal data

Please don't generalize so much. Not everyone can afford machines on the high end. It's still true today that if your app takes up more than 1GB of RAM, a huge chunk of users won't be able to use it and will file bugs.

And even if the total size of someone's RAM is high, that doesn't mean any individual app can afford nearly as much of it. On my machine, each individual app always takes less than 1GB but (for some reason), more than half of my 16GB RAM is always used up even with only 3 apps running.

Neme12 commented 7 months ago

I believe that instead, it would be more beneficial (both short and long term) to look at ways .NET can make it easier to work with big data in a "better" way. For example, what could be done to make it easy for users to chunk/stream their data while making it appear to the user as if it were contiguous (for both reading and writing)?

I agree that this is worthwhile. But this has certain downsides as well. Think about how many different collection types there are, even inside .NET itself, for varying kinds of scenarios. If you had a data structure that would allow working with it as if it's contiguous but isn't, that would certainly be nice, but then you'd be limited to this one data structure (probably equivalent to an array), unless you wanted to duplicate all of the existing rich collection types for big storage. You wouldn't even have a separate String type, or List, as distinct from an array. So it would still be a subpar experience in this way, even if that one data structure was perfect at its intended scenario.

Neme12 commented 7 months ago

Now Photoshop would become a "scientific edition" application, which means that all of its plugins would also have to become so. The end effect is that in one action, it invalidates the entire rich ecosystem of plugins. You'll have to go to all of your plugin authors and ask them to produce new builds for you if you want to continue consuming them. (And this doesn't even touch on what a time sink it would be for an app as large as Photoshop - with likely hundreds of third-party dependencies - to make sure that all of its dependencies are also "scientific edition" enlightened.)

I don't think this is such a big deal as it is made out to be. Not that long ago, every new version of MSVC had ABI breaks, meaning you had to recompile everything that's part of your app, and the world didn't fall apart. Now they haven't done it in a while, but they're considering doing it again as a new major release to fix issues that can't be fixed otherwise. Even Visual Studio 2022 made every extension have to be recompiled (and source changed) for x64, and it seems to be doing just fine, even though they "invalidated the entire rich ecosystem of plugins" as you say.

I think .NET is (still) way too afraid of breaking changes. Major releases exist for a reason. IMHO it's a shame that .NET went through such a huge breaking change rebuilding itself as .NET Core and going through many incompatibilities and with so many application frameworks that are simply gone in the new .NET, yet it still didn't address any foundational issues like this :/

Neme12 commented 7 months ago

This also limits allowable interfaces to pretty much IEnumerable.

I don't think I could work with that in practice. Only IEnumerable<T> and no other interfaces? It goes against what I've been doing in a many places in many .cs files for years.

Why would getting rid of array covariance limit interfaces to only IEnumerable<T>? Couldn't it still implement readonly interfaces like IReadOnlyCollection<T> and IReadOnlyList<T> and still have covariance with them since the interfaces themselves are readonly & covariant? After all, ReadOnlySpan<T> is covariant without any runtime checks, since it's readonly? And couldn't it even implement non-readonly interfaces, because they're not covariant and you couldn't change the array to an array of a different type by using them?

MineCake147E commented 7 months ago

It's still true today that if your app takes up more than 1GB of RAM, a huge chunk of users won't be able to use it and will file bugs.

And even if the total size of someone's RAM is high, that doesn't mean any individual app can afford nearly as much of it. On my machine, each individual app always takes less than 1GB but (for some reason), more than half of my 16GB RAM is always used up even with only 3 apps running.

Users always need the right equipments to run a certain software anyway. Application-specific machines, such as supercomputers, servers, workstations, enthusiast HEDT PCs, and gaming PCs, are all tailored for the set of softwares the users want to use.

tannergooding commented 7 months ago

Users always need the right equipments to run a certain software anyway. Application-specific machines, such as supercomputers, servers, workstations, enthusiast HEDT PCs, and gaming PCs, are all tailored for the set of softwares the users want to use.

There is a large difference between specialized software and general-purpose software and many of the callouts you gave are still categorized under "general purpose".

Even for games, the "typical" gaming PC (based on Steam Hardware Survey, but also account for consoles and other reporting sites) is a 6 core, 16GB PC running something like an RTX 3060. Machines running 4 cores, 8GB of RAM, and older cards like a GTX 1060 aren't that far behind in usage either, nor are many mobile chipsets.

But, even if you do have a more powerful machine, a typical user is still sharing that machine with other apps, services, and still doesn't want to go over around 80% consumption of the max (due to risk of causing thrashing and contention issues).

you had to recompile everything that's part of your app, and the world didn't fall apart

This world did semi-regularly have issues, often large issues, which app developers had to deal with. Almost every single platform (C, C++, Java, Rust, C#, Python, etc) has been moving towards making their ABIs more stable.

they're considering doing it again as a new major release to fix issues that can't be fixed otherwise

Right, sometimes breaks are necessary; especially in the AOT world. But they are being made intentional and targeted to reduce risk and pain.

I think .NET is (still) way too afraid of breaking changes. Major releases exist for a reason. IMHO it's a shame that .NET went through such a huge breaking change rebuilding itself as .NET Core and going through many incompatibilities and with so many application frameworks that are simply gone in the new .NET, yet it still didn't address any foundational issues like this

There's two main considerations here.

First, you need to show that this is an actual foundational issue. The general argument I've given above is that while having 64-bit arrays would be convenient, it is not actually a foundational issue and in many ways not an issue at all due to the alternatives that exist and the real world considerations that an application or library working with such large data would fundamentally have to make to ensure that it remains usable.

The second is that .NET Core 1.0 did try to make some big breaks and we found out by .NET Core 2.1 that many of them were a mistake and caused too much pain, so much so that we brought most APIs back and made the ecosystem almost whole/consistent again by .NET Core 3.1. The few exceptions were scenarios that were used so little and which were so fundamentally problematic, that bringing them forward as is was a non-starter and where we've since provided or are working on providing alternatives instead.

So it would still be a subpar experience in this way,

Yes, but working with large data is a different experience altogether regardless. You can't just take a data structure that was designed to work with thousands of elements and expect it to behave well when working with billions or more elements. The inverse is also true.

Any application that required working with many gigabytes or more of data and decided to do that by simply writing their code the same way as they would for kilobytes of data would end up digging it's own pitfalls. We don't live in an idealized world, costs don't actually scale linearly, not every operation is equivalent. We have many tiers of costs and we have limits on operations and how much can be pushed through a given pipeline. An application that might saturate those pipelines has to rethink how its working so it doesn't become bottlenecked. It likewise needs to aim to balance across all pipelines to ensure they all achieve suitable levels of use (neither maxing them out or underutilizing them).

.NET arrays already allow you to work with gigabytes of data, up to almost 2.14 billion elements (regardless of the size per element). Once you start approaching this limit, you're already at the point where you probably should've started restructuring and rethinking how your application work, even if your machine has terabytes of RAM available to it.

MineCake147E commented 7 months ago

Any application that required working with many gigabytes or more of data and decided to do that by simply writing their code the same way as they would for kilobytes of data would end up digging it's own pitfalls.

It's certainly true that applications dealing with substantial amounts of data need careful consideration in their design and implementation. While it might seem tempting to simply write code as if handling kilobytes of data when dealing with gigabytes or more, it's crucial to recognize the potential pitfalls such an approach can bring.

However, rather than solely attributing responsibility to the .NET Runtime for addressing users' mistakes, it's worth considering a more balanced perspective. While the runtime can offer certain safeguards and guidance, it's ultimately the responsibility of the developers to ensure their applications are designed and optimized appropriately for handling large datasets.

Mistakes and challenges in dealing with large-scale data can indeed be valuable learning experiences for programmers. By encountering and addressing issues firsthand, developers gain insights into what went wrong and how to improve their approaches in the future. Improving programmers' comprehension of effective data management strategies is possible irrespective of the array allocation length limit; however, arbitrary constraints might impede this learning journey.

Furthermore, while partitioning memory allocation may seem advantageous, they come with their own set of drawbacks. These include costly random access, complex operations, debugging and testing challenges, additional overhead in C# code, and cumbersome software cross-boundary vector data loading and storing. Is the tradeoff always worth it? If a simpler solution existed, allowing for the allocation of contiguous managed memory regions with more than 2 billion elements, many of these complexities would dissipate. ~~Handling large linear arrays could also potentially benefit garbage collection efficiency, simplifying management compared to dealing with numerous smaller arrays.~~

First, you need to show that this is an actual foundational issue.

~~Imagine if Array.MaxLength were restricted to 32767. Historically, computing faced similar constraints due to limitations in hardware, such as 16-bit architectures. But as technology evolved, so did the need for larger data structures, leading to the adoption of 32-bit and eventually 64-bit systems.~~

The crux of the matter isn't merely the fixed value of 2,147,483,591 for Array.MaxLength. Instead, it's the widespread reliance on int for Length properties and indices throughout the .NET ecosystem, rather than embracing nuint, akin to C++'s size_t. In hindsight, the preference for nuint seems more logical, given its implementation-defined size, aligned with pointers.

While implementing radical changes at this stage could entail significant disruptions, there's merit in introducing new types such as NativeArray<T>, NativeSpan<T>, ReadOnlyNativeSpan<T>, NativeMemory<T>, and ReadOnlyNativeMemory<T>, all featuring nuint Length { get; }. Moreover, NativeArray<T>, NativeSpan<T>, and ReadOnlyNativeSpan<T> could also have indexers accepting nuint indices. As a positive side effect, this approach could enhance the management experience of large unmanaged memory regions.

tannergooding commented 7 months ago

While the runtime can offer certain safeguards and guidance, it's ultimately the responsibility of the developers to ensure their applications are designed and optimized appropriately for handling large datasets.

Yes. However, if .NET already doesn't support the thing developers shouldn't be doing (in this case for historical reasons) then there is little to no benefit in .NET doing the massively complex work to support it because we'd be adding support for something that no real world scenario should be using in the first place.

The discussion basically stops there as there is no justification for adding support for something which no one should ever be using and for which if users "really" decide they need it, there are plenty of viable workarounds.

If developers need to learn, they can just as easy find this issue and read some of the feedback. Developers coming in and saying "we want this anyways, even if its the wrong thing" just adds confusion to the topic and hinders that learning process.

Handling large linear arrays could also potentially benefit garbage collection efficiency, simplifying management compared to dealing with numerous smaller arrays.

That's really not how that works. The GC and memory allocators in general are optimized for the common cases. They tend to have more overhead, more restrictions, and generally behave worse when introduced to uncommon patterns or extremely large allocations.

Today, the GC considers anything over 85KB as "large" and places it on a special heap (LOH) which is almost never compacted or moved. When such large allocations are over reference types, they have a lot more data that needs to be processed to determine liveness and the overhead of that many objects is going to be substantially more than almost anything else.

Even if the support did exist, it wouldn't get optimized because it wouldn't be a common case. It would never become common because users would start trying it and see that the perf was awful and they'd end up having to research and find out they shouldn't be doing this stuff in the first place. They then wouldn't use the feature as they'd see they should be doing other things instead.

Imagine if Array.MaxLength were restricted to 32767.

This is a straw person argument. Computers being restricted to 16-bits would indeed be limiting, but we aren't restricted to 16-bits. We're restricted to a reasonably large boundary (2.14 billion elements, which could take up more than 32-bits of memory space) well above and beyond what hardware supports for its built in systems and the amounts they are designed to be efficient around.

What you're stating is like arguing that because we went from 8-bit to 16-bit, 16-bit to 32-bit, and 32-bit to 64-bit address spaces in the past 40 years that we will need to go from 64-bit to 128-bit in the next 100. This is the type of thing that a person might naively state without considering that this is an exponential growth and that 2^64 is such an astronomically large number that a single computer cannot actually hold that much RAM without us fundamentally breaking how the hardware is manufactured and even beyond that, essentially breaking our current understanding of physics.

To actually work with that much memory, you need many machines. Once you have many machines, you have non-coherent memory. Once you have non-coherent memory, you have to fundamentally change how you work with the data for both correctness and performance. Distributed computing then typically has a customized Operating System, specialized infrastructure and networking layouts, and unique data management systems to ensure that data can be correctly and efficiently moved about without hitting bottlenecks. This is true for super computers, IoT compute clusters, server setups for datacenters, etc.

The same general considerations tend to happen when you get beyond 2 billion elements. Once you start getting that big, you can't actually reliably make the allocations unless you limit the hardware it runs against. Even if you restrict the hardware you run against, you're going to ultimately hurt the overall performance and considerations of the application because you aren't the only process running. The cost of accessing any piece of memory isn't linear, it starts getting bottlenecked by the increasingly smaller caches and prefetch mechanisms (so much so that in many cases you want to bypass these caches by using non-temporal accesses). You also start getting into compute times where you fundamentally have to start parallelizing the computation.

All of these changes end up meaning that the way the data needs to be worked with has changed as does how you need to manage the data. You have a need to chunk the data into many separate views that can be independently accessed by separate cores and then independently freed, paged out, etc as the data gets processed. You have to balance the management of that data with the processing time of that data to ensure that the CPU stays reasonably saturated without oversaturating it (typically about 80% utilization is the ideal).

You may start considering the need for GPU compute, where a single GPU may peak around 24GB of RAM. Where the GPU itself may be limited in how much they can transfer at one time (until relatively recently this was capped at 256MB). Where you actually have to copy data from one memory system to another and where that memory is non-coherent. Where the allocations and the amount a shader can access at any one time may itself be limited, etc.

None of this is a problem unique to .NET. It's just the world of working with "big" data and where "big" is a lot smaller than 2 billion elements.

Instead, it's the widespread reliance on int for Length properties and indices throughout the .NET ecosystem, rather than embracing nuint, akin to C++'s size_t.

This is also isn't really a good argument and there are many reasons why using int is "better". Even if there were support for some NativeArray type, it's entirely likely that the indexer would be nint (not nuint) as there are benefits to using a signed type and many drawbacks to using an unsigned type. There are some conceptual reasons for why nuint is "better", but in practice it tends to lose out and there are many places you may fundamentally require signed anyways.

Types like T[] itself already support nuint based indexers, even with the length being limited to 31-bits. Such support could be added to Span<T> if we saw enough benefit to doing so. However, the JIT is already optimized around the fact that its a 31-bit always positive length and will typically do the relevant hoisting and other opts to ensure the indexing remains just as efficient regardless. This also allows the general codegen to be smaller and more efficient, as there is an encoding and often perf optimization to using 32-bit registers (over 64-bit registers) on many platforms.

MineCake147E commented 7 months ago

This is a straw person argument.

Oops. I didn't know that it is until you noticed it. Sorry for that.

GPSnoopy commented 7 months ago

Just to cross check a few facts, plus some comments.

What you're stating is like arguing that because we went from 8-bit to 16-bit, 16-bit to 32-bit, and 32-bit to 64-bit address spaces in the past 40 years that we will need to go from 64-bit to 128-bit in the next 100. This is the type of thing that a person might naively state without considering that this is an exponential growth and that 2^64 is such an astronomically large number that a single computer cannot actually hold that much RAM without us fundamentally breaking how the hardware is manufactured and even beyond that, essentially breaking our current understanding of physics.

AMD Zen 4 increased its virtual address space from 48-bit to 57-bit. Wikipedia indicates Intel did the same with Ice Lake in 2019 (https://en.wikipedia.org/wiki/Intel_5-level_paging).

Intel 386 (the first x86 32-bit CPU) was introduced in 1985. AMD Opteron (the first 64-bit CPU) was introduced in 2003. This is 18 years later, not 40. While there is no immediate need for 128-bit physical addressing. Actual support for virtual addressing using more than 64-bit is not as far as you think.

In the meantime, the typical amount of RAM per CPU (i.e. UMA) on a server is 768GB-1.5TB. AWS has NUMA instances that go up to 24TB.

The cost of accessing any piece of memory isn't linear, it starts getting bottlenecked by the increasingly smaller caches and prefetch mechanisms (so much so that in many cases you want to bypass these caches by using non-temporal accesses). You also start getting into compute times where you fundamentally have to start parallelizing the computation.

You imply too much on how one might use large arrays. The random-access and linear access cost of any array are the same for any size. Cache locality for random access on large arrays depends heavily on the algorithm and its data locality, rather than the fact that a large array has been allocated in one go. Cache locality goes out of the window if you do a linear access on an array that does not fit in cache (even if you split the array into smaller ones). Not sure why you want to deny the ability for people to use large arrays.

You may start considering the need for GPU compute, where a single GPU may peak around 24GB of RAM. Where the GPU itself may be limited in how much they can transfer at one time (until relatively recently this was capped at 256MB). Where you actually have to copy data from one memory system to another and where that memory is non-coherent. Where the allocations and the amount a shader can access at any one time may itself be limited, etc.

Last night, NVIDIA announced the Blackwell GPU which has 192GB of RAM. The previous NVIDIA GPU, the H100, has 80GB of RAM. CUDA has no issue with large data transfer (larger than 256MB) and 64-bit addressing.

None of this is a problem unique to .NET. It's just the world of working with "big" data and where "big" is a lot smaller than 2 billion elements.

You and I do not have the same definition of big data. If I can fit it on a laptop, it's not big.

This is also isn't really a good argument and there are many reasons why using int is "better".

I think the whole technical discussion is moot. @tannergooding is just regurgitating the same arguments as people did in 2003 when AMD64 came out; arguing why it's useless, has too many performance compromises (e.g. pointers being twice the size), and somehow going on a crusade to unilaterally deny its access to programmers/users. In the meantime the world as moved on. Seriously.

The real question is where the actual decision makers at Microsoft see C# and .NET going. My observation (and this is purely my biased opinion) is that Microsoft has nowadays little appetite for .NET (in markets such as system programming, REST servers, large scale services, HPC, GPUs, AI, etc), instead chasing subscription-based revenues and AI services; leaving .NET mostly for "small" applications or services, mobile apps, and some Unity games.

Neme12 commented 7 months ago

Users always need the right equipments to run a certain software anyway. Application-specific machines, such as supercomputers, servers, workstations, enthusiast HEDT PCs, and gaming PCs, are all tailored for the set of softwares the users want to use.

If you agree that this is a specialized scenario for specialized hardware, sure. I was pushing back on gigabytes of memory being called normal data rather than big data.

tannergooding commented 7 months ago

AMD Zen 4 increased its virtual address space from 48-bit to 57-bit.

Yes, and that is still nowhere near the entire 64-bit address space. It is an exponential growth.

Different scales in terms of nanoseconds:

2^16: 65.54 microseconds
2^32: 4.295 seconds
2^48: 3.258 days (78.19 hours)
2^57: 1668 days (238.3 weeks)
2^64: 584.6 years (213504 days)

The reason why I use nanoseconds is because computers currently operate in terms of GHz, 1hz is therefore 1ns.

The fastest CPUs can boost to around 6GHz today and with the assistance of liquid nitrogen we've managed to nearly hit 9GHz. The fastest CPUs can do parallel dispatch of up to around 5 additions per cycle, assuming they are non-dependent.

So assuming we simply wanted to increment 2^64 times, it would take a single core on a modern processor 12.97 years to do so, operating under peak efficiency and never slowing down. If we scale that to the peak number of cores in a single CPU (192) and factor in that they're hyperthreaded and without factoring in the lower clock speeds, then a single modern CPU can iterate 2^64 times in around 12.33 days.

This of course isn't factoring in that memory access is tens to hundreds of times slower than direct instruction execution, that the peak transfer rate of things like DDR5 is 64 GB/s and in practice is much lower due to contention, random access, etc. It also isn't factoring in that we're hitting the limits of how small we can make transistors due to the effects of quantum tunnelling or that there are other fundamental barriers related to heat dissipation, the speed at which electricity can travel through various materials, etc. That while we will continue to see some increases, we require a massive scientific breakthrough and likely a change in how computers are designed to get them substantially faster that what we have today.

The random-access and linear access cost of any array are the same for any size. Cache locality for random access on large arrays depends heavily on the algorithm and its data locality, rather than the fact that a large array has been allocated in one go. Cache locality goes out of the window if you do a linear access on an array that does not fit in cache (even if you split the array into smaller ones). Not sure why you want to deny the ability for people to use large arrays.

The consideration is that many of the necessary data management tasks can no longer be trivially done when it is actually a single allocation. These becomes significantly more trivial if you break it up into appropriate sized chunks and that is just how the data is actually interacted with in systems that are designed to work with large data. They recognize that it is technically a contiguous thing, but that it needs to be appropriately buffered, chunked, streamed, and managed to account for real world hardware limitations.

Last night, NVIDIA announced the Blackwell GPU which has 192GB of RAM. The previous NVIDIA GPU, the H100, has 80GB of RAM.

Yes and this is an extremely high end GPU that is not meant for a typical consumer. The typical GPU is far below 24GB with 24GB being the upper end of what the highest end consumer PCs and even the upper end of what many cloud systems offer.

CUDA has no issue with large data transfer (larger than 256MB) and 64-bit addressing.

There is a difference between the API behind the scenes chunking the data for you and it actually working with greater than 256MB chunks.

The limit of 256MB transfers comes about from the PCIe specification which required an optional feature introduced in 2007 known as reBAR (resizable base address register) to do individual accesses of more than 256MB in size at a time. GPU manufacturers (such as AMD, Intel, and Nvidia) only started offering support for this feature around 3-4 years ago.

You and I do not have the same definition of big data. If I can fit it on a laptop, it's not big.

You're free to have your own definitions for big vs small, but that doesn't negate how general systems think about the data.

To the GC, anything over 85KB is large and goes on a special heap.

To standard implementations of memcpy (provided by MSVC, GCC, Clang, etc) 256KB is typically considered large and is the point at which non-temporal accesses start getting used. Other cutoffs where the algorithms change tend to be around 8K and around 32 bytes. All done to account for typical allocation sizes, overhead of branches to dispatch to the right algorithm, and real world hardware costs for these various sizes.

The same general considerations also apply to the CPU optimization manuals, the implementations of malloc provided by the C Runtime, the memory management APIs provided by the Operating System, etc.

In terms of address space, these are relatively tiny. Even in terms of typical file size (considering modern images or videos), these are relatively tiny amounts. But they are still significant in terms of the scale of the CPU, the hardware itself, and the limits its designed around for efficiency.

I think the whole technical discussion is moot. @tannergooding is just regurgitating the same arguments as people did in 2003 when AMD64 came out; arguing why it's useless, has too many performance compromises (e.g. pointers being twice the size), and somehow going on a crusade to unilaterally deny its access to programmers/users. In the meantime the world as moved on. Seriously.

They are very different arguments. When 64-bit came out, we were already hitting the limits of 32-bit address space, especially across an entire machine. There was a clear need for the growth even if individual applications weren't near these limits and even if typical applications weren't near those limits.

There has been no argument against needing to work with data that is more than 2GB in size nor against there being cases where having a single linear allocation would be convenient.

I have stated that a NativeSpan type might be beneficial. It's something I've pushed for on several occasions. It have stated that there are concrete concerns with a NativeArray type, particularly if that is allowed to contain managed references.I have stated that working with big data is problematic today and it is an area where we could improve things substantially.

The arguments have therefore been around how the considerations for accessing big data change compared to accessing lesser amounts of data, how we are no where near the limits of the 64-bit address space, how we are approaching fundamental limits in how computers operate that are unlikely to change without a fundamental breakthrough in our understanding of physics, and how the effort would be better spent investing in making it easier to work with big data, irrespective of how its been allocated and particularly allowing for many separate allocations so that the code can work across a broader range of machines and considerations (both small and large distributed systems require similar considerations here for big data).

The real question is where the actual decision makers at Microsoft see C# and .NET going. My observation (and this is purely my biased opinion) is that Microsoft has nowadays little appetite for .NET (in markets such as system programming, REST servers, large scale services, HPC, GPUs, AI, etc), instead chasing subscription-based revenues and AI services; leaving .NET mostly for "small" applications or services, mobile apps, and some Unity games.

.NET works just fine with large scale services today and is used by many systems that operate on global scales servicing millions of customers per day, touching petabytes and more worth of data: https://dotnet.microsoft.com/en-us/platform/customers

There are of course even more big customers and scenarios that aren't listed on the showcase page. There are plenty of community projects that are targeting these scenarios, there are efforts by the .NET Libraries team to continue improving our support for things like Tensors, working with large data in general, etc.

.NET has a clear interest in supporting working with big data, but that in part has to come with consideration for backwards compatibility, with consideration for real world usage (not just convenient usage), and with consideration for what will lead users towards the greatest success.

Neme12 commented 7 months ago

While I like the appeal of using nint for size just for the sake of correctness and it bothers me a little that .NET is using int everywhere, I have to concede that changing that and breaking compatibility for relatively little benefit probably isn't worth it. Although it would have been nice if at least Span used nint correctly from the beginning so that you could at least refer to unmanaged memory, which definitely can be larger than int-sized, in a convenient (and consistent) way.

In the meantime, the typical amount of RAM per CPU (i.e. UMA) on a server is 768GB-1.5TB. AWS has NUMA instances that go up to 24TB.

That's true, but:

All that memory is there to be able to run tons and tons of apps at once, not for just one app to use up most of it.
Your app can still use all that memory, you just can't allocate a contiguous block of (managed) memory that is that big, which is a much smaller limitation than if you couldn't use that memory at all.

Neme12 commented 7 months ago

This also allows the general codegen to be smaller and more efficient, as there is an encoding and often perf optimization to using 32-bit registers (over 64-bit registers) on many platforms.

I thought using a native size would be more efficient, because the JIT has to widen an int to native size for array access? Or doesn't it?

tannergooding commented 7 months ago

I thought using a native size would be more efficient, because the JIT has to widen an int to native size for array access? Or doesn't it?

TL;DR: Not really. It's implicit as part of 32-bit operations.

For RISC based architectures (Arm32/Arm64, Risc-V, LoongArch and others), they typically use a fixed-width encoding and so the actual decoding cost is the same. The execution costs, however, can still differ although this namely applies to operations like multiplication and division, not to operations like addition, subtraction, or shifting. However, because its "fixed-width" generating constants can take multiple instructions and so you may need 3-4 instructions to generate some 64-bit constants.

For CISC based architectures (x86 and x64 namely), they typically use a variable-width encoding. For many reasons, including back compat but also factoring in common access sizes, the smallest encoding works with 32-bit registers and you need an additional prefix byte to do 8, 16, or 64-bit register access. This minorly increases the cost to the decoder and can negatively impact other code by pushing bytes outside the normal decode/icache windows.

For both sets of architectures, it is typical that doing a 32-bit operation will implicitly zero the upper 32-bits. So doing an inc eax for example ensures the upper 32-bits are zero and no explicit zero extension is needed. The simplest example is:

public static int M(Span<int> span)
{
    int sum = 0;

    for (int i = 0; i < span.Length; i++)
    {
        sum += span[i];
    }

    return sum;
}

Which generates the following:

; Method Program:M(System.Span`1[int]):int (FullOpts)
G_M000_IG01:                ;; offset=0x0000
    ; No prologue

G_M000_IG02:                ;; offset=0x0000
    mov      rax, bword ptr [rcx]         ; load the byref field into rax
    mov      ecx, dword ptr [rcx+0x08]    ; load the length into ecx, implicitly zero-extending
    xor      edx, edx                     ; zero the sum local
    xor      r8d, r8d                     ; zero the index
    test     ecx, ecx                     ; check if the length is 0
    jle      SHORT G_M000_IG04            ; skip loop if it is
    align    [0 bytes for IG03]

G_M000_IG03:                ;; offset=0x000F
    mov      r10d, r8d                    ; copy index into r10d, technically unnecessary
    add      edx, dword ptr [rax+4*r10]   ; load element from the index and add to the sum
    inc      r8d                          ; increment the index
    cmp      r8d, ecx                     ; check if we've hit the bounds
    jl       SHORT G_M000_IG03            ; if not, continue the loop

G_M000_IG04:                ;; offset=0x001E
    mov      eax, edx                     ; move the sum into the return register

G_M000_IG05:                ;; offset=0x0020
       ret                                ; return
; Total bytes of code: 33

If you instead use the 64-bit based indexing, you end up increasing the encoding cost by around 1-byte per instruction (which uses the wider register). It probably won't matter in practice, but its still unnecessary "bloat" to the 99% use case.

huoyaoyuan commented 7 months ago

I think we shouldn't waste much time on how large the benefit is - it is definitely beneficial for many cases.

Instead, the topic should focus more about how world-breaking it would be, for existing 32bit-based code. Do more magic for codegen may also affect the performance.

If you instead use the 64-bit based indexing, you end up increasing the encoding cost by around 1-byte per instruction (which uses the wider register). It probably won't matter in practice, but its still unnecessary "bloat" to the 99% use case.

This looks like xarch specific. Popular RISC architectures use the same size for instruction encoding. This is something to consider but should not matter to much.

MineCake147E commented 7 months ago

Yes. However, if .NET already doesn't support the thing developers shouldn't be doing (in this case for historical reasons) then there is little to no benefit in .NET doing the massively complex work to support it because we'd be adding support for something that no real world scenario should be using in the first place.

The discussion basically stops there as there is no justification for adding support for something which no one should ever be using and for which if users "really" decide they need it, there are plenty of viable workarounds.

OK, I found you're right about this. I agree. I no longer argue about this thing.

This is also isn't really a good argument and there are many reasons why using int is "better". Even if there were support for some NativeArray type, it's entirely likely that the indexer would be nint (not nuint) as there are benefits to using a signed type and many drawbacks to using an unsigned type. There are some conceptual reasons for why nuint is "better", but in practice it tends to lose out and there are many places you may fundamentally require signed anyways.

Would you clarity about what's the drawbacks using unsigned types?

I initially thought that nint is better, but I ended up thinking otherwise. Here's the list of a tiny portion of benefits using unsigned types:

No need for checking if the Length is negative
No need for checking if the index is negative
No need for reserving unused sign bit, hence twice the maximum representable size
Zero extension tends to be faster than sign extension
- Zero extension could be done with renaming
Interoperation with native libraries can be easier

Here's the list of a tiny portion of reasons why unsigned indexing wouldn't harm performance anyway:

Loop unrolling can be done in the similar way that signed indexing offers
Reverse loop can also be done in the similar way that signed indexing offers as well
- Simply replacing i >= 0 with i < Length, because -1 is now MaxValue
Tail-relative index can easily be calculated with Length + ~index anyway

I don't see anything I absolutely require signed index types.

2A5F commented 7 months ago

For using signed, there is a certain reason for the stupid Enumerator design of c#

public ref struct Enumerator
{
    /// <summary>The span being enumerated.</summary>
    private readonly Span<T> _span;
    /// <summary>The next index to yield.</summary>
    private int _index;

    /// <summary>Initialize the enumerator.</summary>
    /// <param name="span">The span to enumerate.</param>
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    internal Enumerator(Span<T> span)
    {
        _span = span;
        _index = -1;
    }

    /// <summary>Advances the enumerator to the next element of the span.</summary>
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public bool MoveNext()
    {
        int index = _index + 1;
        if (index < _span.Length)
        {
            _index = index;
            return true;
        }

        return false;
    }

    /// <summary>Gets the element at the current position of the enumerator.</summary>
    public ref T Current
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get => ref _span[_index];
    }
}

Java-style iterators is better with unsigned

public ref struct Enumerator
{
    private readonly Span<T> _span;
    private uint _index;

    internal Enumerator(Span<T> span)
    {
        _span = span;
        _index = 0;
    }

    public bool HasNext() => _index < _span.Length;
    public ref T Next() => ref _span[_index++];
}

MineCake147E commented 7 months ago

For using signed, there is a certain reason for the stupid Enumerator design of c#

You can do the same thing with ~0 instead of -1 because both wouldn't be a valid index anyway.

public ref struct Enumerator
{
    private readonly ref T _head;
    private readonly nuint _length;
    /// <summary>The next index to yield.</summary>
    private nuint _index;

    /// <summary>Initialize the enumerator.</summary>
    /// <param name="span">The span to enumerate.</param>
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    internal Enumerator(Span<T> span)
    {
        _head = ref MemoryMarshal.GetReference(span);
        _length = (uint)span.Length
        _index = ~(nuint)0;
    }

    /// <summary>Advances the enumerator to the next element of the span.</summary>
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public bool MoveNext()
    {
        var index = ++_index;
        return index < _length;
    }

    /// <summary>Gets the element at the current position of the enumerator.</summary>
    public T Current
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get => _index < _length ? Unsafe.Add(ref head, _index) : default!;
    }
}

tannergooding commented 7 months ago

I think we shouldn't waste much time on how large the benefit is - it is definitely beneficial for many cases.

That's part of the discussion point. Whether there is actually benefit or whether it is simply convenience.

Many times convenience is a benefit, but other times it may itself be a pit of failure and be inappropriate due to the pain it causes or due to leading users down the wrong path, especially users who may not know better.

Instead, the topic should focus more about how world-breaking it would be, for existing 32bit-based code. Do more magic for codegen may also affect the performance.

.NET cannot make a breaking change here. As has been iterated many times, almost every single for loop written would now have a chance for silent failure. Any such change to make System.Array actually support 64-bits would have to be a hard break where roll forward was explicitly blocked and where there would need to be significant analyzer and language work to help surface the bugs users might hit (even though the vast majority of code would never actually encounter large arrays). There would likely be security and other concerns from such a break as well.

If it was deemed viable, a new array type could be introduced and which new code could opt into using. However, there have been many reasons pointed out as to why this itself isn't viable and how it interplays poorly with a GC. Such a type would reasonably need to be restricted to unmanaged data only (data that isn't itself a reference type and doesn't contain reference types).

The most viable path here would be to introduce some NativeSpan type which is essentially Span<T> but with an nint based length.

This looks like xarch specific. Popular RISC architectures use the same size for instruction encoding. This is something to consider but should not matter to much.

The considerations for RISC architectures were also called out, including the extended instruction sequences often required to work with anything over a 16-bit constant.

Every architecture has tradeoffs here and almost every one ends up being more expensive for larger register sizes. It only varies where that expense is seen (encoding size vs operation count vs number of instructions vs ...).

tannergooding commented 7 months ago

Would you clarity about what's the drawbacks using unsigned types?

Here's the list of a tiny portion of benefits using unsigned types:

Sure.

One of the primary things to note is that using a signed value doesn't mean negative values are allowed, so many of the "benefits" you've listed for unsigned types equally apply to signed. Many of the checks are likewise equivalent due to the two's complement representation.

No need for checking if the Length is negative

The length for types like Array and Span are known to be "never negative" already, its an implicit assumption made by the runtime and which can only be violated by the user unsafely mutating the underlying state (which itself is undefined and dangerous behavior).

No need for checking if the index is negative

You only have a singular check for signed as well. Due to two's complement representation (x < 0) || (x >= KnownPositive) can just be emitted as (uint)x >= KnownPositive. It's an implicit optimization that is done by many compilers since all negatives will have the most significant bit set and will therefore compare as greater than KnownPositive

No need for reserving unused sign bit, hence twice the maximum representable size

This is really a non-issue. Due to having an Operating System, a runtime, general program state, etc; you can never actually have something that is nuint.MaxValue in length.

On a 32-bit system, it's typical for the OS to explicitly reserve at minimum 1GB of memory, so the maximum user reserved space is 3GB, which of course is shared with other systems. Without explicit opt-in, many systems historically limited this user space to 2GB as well, notably.

On 64-bit systems, most modern hardware actually has special support for always masking off the most significant 7-11 bits of the address so that it can be used to encode special information. However, even if this were to ever expand to the full space, it's very likely that the last bit would be reserved for the system regardless. The amount of space required to encode the page tables when you actually have that much memory, the general needs for the system and memory managers to maintain state, the need to represent the actual programs, etc all prevent you from ever actually using all bits, so reserving the last bit is fine.

Zero extension tends to be faster than sign extension Zero extension could be done with renaming

Depends on the CPU. Many modern CPUs have zero-cost sign-extension as well. However, sign-extension is only needed for values that "could" be negative. The general index usage, bounds checks, and well known state for core types means that values are typically known to be never negative and thus can use zero-extension regardless. There are a couple of niche cases otherwise, but those are easy to workaround. -- More notably, if using nint there is no need to extend regardless, its already the right size.

Interoperation with native libraries can be easier

This depends. Many native libraries do take size_t. However, many other common usages like ptrdiff_t and ssize_t are explicitly signed themselves. You really get a mix of both depending on the scenario and many newer libraries have opted to use signed to avoid some of the issues you encounter, especially when doing subtractions and comparisons with unsigned types.

It's ultimately all about tradeoffs and for typical usages the types can be treated equivalently due to knowing the value is never negative. So, you then just need to consider where the differences do come in, the risk around them, and how much is needed to workaround any issues when they do come up.

davidxuang commented 7 months ago

TBH, I don't think it's possible that unsigned integers are ever considered — they are not even CLS-compliant. That's a huge drawback for interop with or implement some languages.

There are several possible ways to introduce huge arrays: new types, simple breaking change, port LongLength to arrays, or make it optional (like the way Python tries to drop GIL in PEP 703) by introduce size_t-like alias. Though I suppose byte buffers may be the only case that long indexer is often needed.

Previous Next