dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.56k stars 4.55k forks source link

[API Proposal]: Expose flattened index get/set on Array #101244

Open JeremyKuhne opened 2 months ago

JeremyKuhne commented 2 months ago

Background and motivation

When serializing/deserializing array data it is costly and error prone to iterate through multidimensional arrays. Ultimately all Array apis flatten the index when getting/setting. Iterating through the data without having to compose the indices array would be much easier.

(Cost and complexity is around creating and maintaining indices. BinaryFormatter code to do this is fairly complicated and they just get flattened by Array.SetValue with additional cost.)

API Proposal

namespace System.Collections.Generic;

public class Array
{
+    public void SetValueByFlattenedIndex(object? value, long flattenedIndex)
+    {
+        // Check for out of range
+        InternalSetValue(value, flattenedIndex);
+    }

+    public object? GetValueByFlattenedIndex(long flattenedIndex);
}

API Usage

public void SetArrayData(Array array, object?[] rawData)
{
    for (int i = 0; i < rawData.Length; i++)
    {
        array.SetValueByFlattenedIndex(rawData[0], i);
    }
}

Alternative Designs

One can grab spans of the raw data if one knows the T by pinning or doing something like this (I think?):

// Should be ok if array.GetType().UnderlyingSystemType.IsAssignableFrom(typeof(T))?
Span<T> data = MemoryMarshal.CreateSpan(ref Unsafe.As<byte, T>(
    ref MemoryMarshal.GetArrayDataReference(array)),
    checked((int)array.LongLength));

I presume that you can use object? as the T if you have reference types?

If you have a value type that has object references in it and all you have is Type, I'm not sure there is a way to safely set a given raw offset other than calling Array.SetValue. (Or possibly ArrayCopy with a smaller array with the same number of dimensions, but I haven't validated.)

Risks

No response

dotnet-policy-service[bot] commented 2 months ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

adamsitnik commented 2 months ago

The lack of this API was a PITA for me when I was implementing the Payload Reader for BF. FWIW this is what I ended up doing: https://github.com/adamsitnik/SafePayloadReader/blob/787b24d4e535ddee181b8e7da2546fe258735ef7/System.Runtime.Serialization.BinaryFormat/Records/RectangularOrCustomOffsetArrayRecord.cs#L50-L123

tannergooding commented 2 months ago

This is notably a "dangerous" operation. While you can't exactly AV (provided you still check the index is within the flattened length of the array), you are still bypassing the bounds checks indicated by the array's shape and this can easily cause corruption or other issues if the developer isn't careful. -- Accessing linear index 16 could be, row 4, column 4; row 2, column 8; row 8, column 2, etc; and the actual visual location changed is different for each of these. A common bug in image manipulation software is caused by this kind of issue, where you accidentally increment the column past the end of a row and start drawing into the next row instead, when it actually should have terminated.

I think it would be beneficial to have, but I do very much think it should be very explicit in its name that the operation is Unsafe and there should likely be some version that is generic in order to avoid any boxing cost. -- As an alternative, perhaps these should be APIs in MemoryMarshal or extension methods typed over the most common dimension sizes (T[,], T[,,], etc).

JeremyKuhne commented 2 months ago

This is notably a "dangerous" operation.

@tannergooding The dimension layout is deterministic. I presume you mean dangerous in that you don't translate the indices to linear index correctly? This is precisely the issue I want to avoid for de/serialization by never messing with the indices array.

JeremyKuhne commented 2 months ago

As an alternative, perhaps these should be APIs in MemoryMarshal or extension methods typed over the most common dimension sizes (T[,], T[,,], etc).

Note I need to support any rank (2 - 32) and any (not previously known) Type.

tannergooding commented 2 months ago

The issue being that a uint[,] array = uint[1080, 1920] has a Length of 2_073_600. Thus any linear index [0, 2_073_599] is valid, but not all linear indices computed as (y * 1920) + x are valid. For example, array[1, 1920] is itself invalid and would throw, but used in the standard computation it accesses linear index 3840, which is actually array[2, 0].

So in general, you're losing safety, making this is an "unsafe" operation. So users should accordingly have it made very clear that this is the case.

This type of miscomputation is a super common bug, especially in more complex loops, coordinate calculation logic, etc. A simple example is when you're taking something like a user command DrawLine(start, end) which operates in "world space" and then you're translating that to "screen space". The actual line can extend outside the screen space, and if you've not doing clipping correctly, you'll end up drawing into the left side of the buffer instead. This is then missed because its still "in bounds" of the overall image space and it may only appear on the screen for a fraction of a second in some cases.

The same general issue extends outside image processing/rasterization as well, and applies to any type of multi-dimensional access. In general the best thing would be for the JIT to recognize the general pattern of for (int y = 0; ...) { for (int x = 0; ...) { array[y, x] = ...; } } and elide bounds checks itself. An API as proposed here is the next best thing and an unsafe workaround since it bypasses those bounds checks.

JeremyKuhne commented 2 months ago

So in general, you're losing safety, making this is an "unsafe" operation. So users should accordingly have it made very clear that this is the case.

I'm not opposed to using "Unsafe" in the API name. I suspect that it would be unlikely that people would use this one over the one that takes indices if they have them. That said, we should probably add an overload that takes ReadOnlySpan<int> for the mentioned SetValue.

JeremyKuhne commented 2 months ago

Here is how I'm currently trying to decompose a flattened index:

int ranks = array.Rank;

Span<int> rankProducts = stackalloc int[ranks];
rankProducts[ranks - 1] = 1;
for (int i = ranks - 2; i >= 0; i--)
{
    rankProducts[i] = array.GetLength(i + 1) * rankProducts[i + 1];
}

Span<int> indices = stackalloc int[ranks];
for (int i = 0; i < ranks; i++)
{
    indices[i] = flattenedIndex / rankProducts[i];
    flattenedIndex -= indices[i] * rankProducts[i];
}

array.SetValue(value, indices.ToArray());
jkotas commented 2 months ago

Array.Copy, Array.Clear, Buffer.BlockCopy and number of other APIs accept flattened index for multidimensional arrays. None of these APIs have unsafe in the name.

BTW: The flattened index has odd behavior for arrays with lower bounds in Array.Copy and Array.Clear. The implementation subtracts the first lower bound from it, but no other lower bounds.

danmoseley commented 2 months ago

Are jagged arrays a relevant case? Or Array class does not represent those in the same way (not at desk to try)

jkotas commented 2 months ago

Flattened index does not make sense for jagged arrays.