[API Proposal]: Add `Utf8JsonReader.TryGetBytesFromBase64(Span<byte> span, out int bytesWritten)` overload

bill-poole commented 7 months ago

Background and motivation

Reading a Base64-encoded string value from a Utf8JsonReader currently always allocates. However, in high performance scenarios, we want to be able to possibly stackalloc or the output buffer or rent it from the array pool. A Utf8JsonReader.TryGetBytesFromBase64(Span<byte> span) overload would allow the caller to create the buffer in any way.

API Proposal

namespace System.Text.Json
{
    public ref struct Utf8JsonReader
    {
        public bool TryGetBytesFromBase64(Span<byte> span, out int bytesWritten);
    }
}

API Usage

var encodedLength = reader.HasValueSequence ? (int)reader.ValueSequence.Length : reader.ValueSpan.Length;
var buffer = ArrayPool<byte>.Shared.Rent(encodedLength);
if (!reader.TryGetBytesFromBase64(buffer, out var bytesWritten))
{
    throw new JsonException();
}

// Do something with buffer
var decoded = buffer.AsSpan()[..bytesWritten];

// Return the buffer to the array pool
ArrayPool<byte>.Shared.Return(buffer);

Alternative Designs

It's currently possible to do this if the value is not escaped by writing code that uses System.Buffers.Text.Base64, but the logic to do this is already inside the System.Text.Json library and just needs to be exposed.

Risks

No response

dotnet-policy-service[bot] commented 7 months ago

Tagging subscribers to this area: @dotnet/area-system-buffers See info in area-owners.md if you want to be subscribed.

eiriktsarpalis commented 7 months ago

Have you tried using one of the Utf8JsonReader.CopyString methods? They don't decode to Base64 directly but you can use an intermediate pooled that can then be decoded as appropriate.

bill-poole commented 7 months ago

No, I didn't think of doing that. But, wouldn't that be at the cost of an additional copy from the Utf8JsonReader to that buffer? i.e., instead of Base64 decoding directly from the underlying UTF-8 buffer to the given Span buffer, we'd be copying (and unescaping if necessary) to a buffer, then Base64 decoding from that buffer into another buffer.

If the value needs to be escaped, then perhaps the .NET library will copy twice anyway, but that would only be in the 0.0001% case that a Base64 value actually needs to be unescaped. Base64 values only use ASCII characters, so should never actually be escaped when serialized. But I recognize that the JSON could in theory represent some or all of the Base64 characters with Unicode escaped characters. So we're really talking about an extreme edge case here.

The vastly more common scenario is that we have Base64 text inside the UTF-8 JSON text that does not need to be escaped, which we want to decode to a user-provided Span<byte> buffer in a single step. The purpose of the proposed API is to decode Base64-encoded property values efficiently. It is inefficient to heap allocate the output buffer, but it is arguably even more inefficient to copy the Base64-encoded string to a buffer, and then decode that to yet another buffer.

Apologies if I've misunderstood something.

eiriktsarpalis commented 7 months ago

What we do internally for cases like that is only call CopyString when the string is found to be escaped or uses ReadOnlySequence:

https://github.com/dotnet/runtime/blob/7def0b725852d55741262f43ed36ed501606f89f/src/libraries/System.Text.Json/src/System/Text/Json/Serialization/Converters/Value/TimeSpanConverter.cs#L40-L51

This ensures that you don't make an intermediate copy for the 99% case, ensuring correctness for the remaining 1% and never allocating unpooled buffers.

bill-poole commented 7 months ago

Okay, good point. So then it is possible to replicate what the System.Text.Json library does with same/similar performance, but there's still a quite bit of tricky code we need to duplicate that already exists inside the System.Text.Json library. An extension method that might do the trick is below. This proposal is then about obviating the need for every user library that needs this to implement the extension method below.

internal static bool TryGetBytesFromBase64(this ref Utf8JsonReader reader, Span<byte> span, out int bytesWritten)
{
    byte[]? encodedArray = null;
    scoped ReadOnlySpan<byte> encodedSpan;
    if (reader.ValueIsEscaped || reader.HasValueSequence)
    {
        var encodedLength = reader.HasValueSequence ? (int)reader.ValueSequence.Length : reader.ValueSpan.Length;
        encodedSpan = encodedLength <= StackallocByteThreshold
            ? stackalloc byte[StackallocByteThreshold]
            : (encodedArray = ArrayPool<byte>.Shared.Rent(encodedLength));
        encodedSpan = encodedArray.AsSpan()[..reader.CopyString(encodedArray)];
    }
    else
    {
        encodedSpan = reader.ValueSpan;
    }

    var status = Base64.DecodeFromUtf8(encodedSpan, span, out var bytesConsumed, out bytesWritten);
    if (encodedArray is not null)
    {
        ArrayPool<byte>.Shared.Return(encodedArray);
    }

    Debug.Assert(bytesConsumed == encodedSpan.Length);
    return status == OperationStatus.Done;
}

This extension method is quite tricky, so it would be useful if it the functionality were to be exposed directly from the Utf8JsonReader struct.

eiriktsarpalis commented 7 months ago

We might consider it in the future, but it's unlikely we would prioritize it given that there is a workaround.

dotnet / runtime