Add ability to stream large strings in Utf8JsonWriter/Utf8JsonReader

anhadi2 commented 2 years ago

EDIT See https://github.com/dotnet/runtime/issues/67337#issuecomment-1812408212 for an API proposal.

Background and motivation

I have a requirement to write large binary content in json. In order to do this, I need to encode it to base64 before writing. The resultant json looks like this:

{
  "data": "large_base64_encoded_string"
}

I have a PipeReader using which I read bytes in loop and keep appending to a list of bytes. I then convert the list into a byte array, then convert it to base64 string and use WriteStringValue to write it.

public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer)
{
    writer.WriteStartObject();
    writer.WritePropertyName("data");
    byte[] byteArray = ReadBinaryData(reader).;
    string base64data = Convert.ToBase64String(byteArray);
    writer.WriteStringValue(base64data);
    writer.WriteEndObject();
}

public byte[] ReadBinaryData(PipeReader reader)
{
    List<byte> bytes = new List<byte>();
    while (reader.TryRead(out ReadResult result))
    {
        ReadOnlySequence<byte> buffer = result.Buffer;
        bytes.AddRange(buffer.ToArray());

        // Tell the PipeReader how much of the buffer has been consumed.
        reader.AdvanceTo(buffer.End);

        // Stop reading if there's no more data coming.
        if (result.IsCompleted)
        {
            break;
        }
    }

    // Mark the PipeReader as complete.
    await reader.Complete();
    byte[] byteArray = bytes.ToArray();
    return byteArray;
}

The problem with this approach is excessive memory consumption. We need to keep the whole binary content in memory, convert to base64 and then write it.

Memory consumption is critical when using Utf8JsonWriter in the override of JsonConverter.Write() method in a web application.

Instead I am proposing a way to stream large binary content.

API Proposal

namespace System.Text.Json
{
    public class Utf8JsonWriter : IAsyncDisposable, IDisposable
    {
        // This returns a stream on which binary data can be written.
        // It will encode it to base64 and write to output stream.
        public Stream CreateBinaryWriteStream();
    }
}

API Usage

public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer)
{
    writer.WriteStartObject();
    writer.WritePropertyName("data");
    using (Stream binaryStream = writer.CreateBinaryWriteStream())
    {
        StreamBinaryData(reader, binaryStream);
        binaryStream.Flush();
    }
    writer.WriteEndObject();
}

public void StreamBinaryData(PipeReader reader, Stream stream)
{
    List<byte> bytes = new List<byte>();
    while (reader.TryRead(out ReadResult result))
    {
        ReadOnlySequence<byte> buffer = result.Buffer;
        byte[] byteArray = buffer.ToArray();
        stream.Write(byteArray, 0, byteArray.Length);
        stream.Flush();
        // Tell the PipeReader how much of the buffer has been consumed.
        reader.AdvanceTo(buffer.End);

        // Stop reading if there's no more data coming.
        if (result.IsCompleted)
        {
            break;
        }
    }

    // Mark the PipeReader as complete.
    await reader.Complete();
}

Alternative Designs

No response

Risks

No response

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.

Issue Details

### Background and motivation I have a requirement to write large binary content in json. In order to do this, I need to encode it to base64 before writing. The resultant json looks like this: ``` { "data": "large_base64_encoded_string" } ``` I have a PipeReader using which I read bytes in loop and keep appending to a list of bytes. I then convert the list into a byte array, then convert it to base64 string and use `WriteStringValue` to write it. ```C# public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer) { writer.WriteStartObject(); writer.WritePropertyName("data"); byte[] byteArray = ReadBinaryData(reader).; string base64data = Convert.ToBase64String(byteArray); writer.WriteStringValue(base64data); writer.WriteEndObject(); } public byte[] ReadBinaryData(PipeReader reader) { List bytes = new List(); while (reader.TryRead(out ReadResult result)) { ReadOnlySequence buffer = result.Buffer; bytes.AddRange(buffer.ToArray()); // Tell the PipeReader how much of the buffer has been consumed. reader.AdvanceTo(buffer.End); // Stop reading if there's no more data coming. if (result.IsCompleted) { break; } } // Mark the PipeReader as complete. await reader.Complete(); byte[] byteArray = bytes.ToArray(); return byteArray; } ``` The problem with this approach is excessive memory consumption. We need to keep the whole binary content in memory, convert to base64 and then write it. Instead I am proposing a way to stream large binary content. ### API Proposal ```C# namespace System.Text.Json { public class Utf8JsonWriter : IAsyncDisposable, IDisposable { // This returns a stream on which binary data can be written. // It will encode it to base64 and write to output stream. public Stream CreateBinaryWriteStream(); } } ``` ### API Usage ```C# public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer) { writer.WriteStartObject(); writer.WritePropertyName("data"); using (Stream binaryStream = writer.CreateBinaryWriteStream()) { StreamBinaryData(reader, binaryStream); binaryStream.Flush(); } writer.WriteEndObject(); } public void StreamBinaryData(PipeReader reader, Stream stream) { List bytes = new List(); while (reader.TryRead(out ReadResult result)) { ReadOnlySequence buffer = result.Buffer; byte[] byteArray = buffer.ToArray(); stream.Write(byteArray, 0, byteArray.Length); stream.Flush(); // Tell the PipeReader how much of the buffer has been consumed. reader.AdvanceTo(buffer.End); // Stop reading if there's no more data coming. if (result.IsCompleted) { break; } } // Mark the PipeReader as complete. await reader.Complete(); } ``` ### Alternative Designs _No response_ ### Risks _No response_

Author:	anhadi2
Assignees:	-
Labels:	`api-suggestion`, `area-System.IO`, `untriaged`
Milestone:	-

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis See info in area-owners.md if you want to be subscribed.

Issue Details

### Background and motivation I have a requirement to write large binary content in json. In order to do this, I need to encode it to base64 before writing. The resultant json looks like this: ``` { "data": "large_base64_encoded_string" } ``` I have a PipeReader using which I read bytes in loop and keep appending to a list of bytes. I then convert the list into a byte array, then convert it to base64 string and use `WriteStringValue` to write it. ```C# public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer) { writer.WriteStartObject(); writer.WritePropertyName("data"); byte[] byteArray = ReadBinaryData(reader).; string base64data = Convert.ToBase64String(byteArray); writer.WriteStringValue(base64data); writer.WriteEndObject(); } public byte[] ReadBinaryData(PipeReader reader) { List bytes = new List(); while (reader.TryRead(out ReadResult result)) { ReadOnlySequence buffer = result.Buffer; bytes.AddRange(buffer.ToArray()); // Tell the PipeReader how much of the buffer has been consumed. reader.AdvanceTo(buffer.End); // Stop reading if there's no more data coming. if (result.IsCompleted) { break; } } // Mark the PipeReader as complete. await reader.Complete(); byte[] byteArray = bytes.ToArray(); return byteArray; } ``` The problem with this approach is excessive memory consumption. We need to keep the whole binary content in memory, convert to base64 and then write it. Memory consumption is critical when using Utf8JsonWriter in the override of `JsonConverter.Write()` method in a web application. Instead I am proposing a way to stream large binary content. ### API Proposal ```C# namespace System.Text.Json { public class Utf8JsonWriter : IAsyncDisposable, IDisposable { // This returns a stream on which binary data can be written. // It will encode it to base64 and write to output stream. public Stream CreateBinaryWriteStream(); } } ``` ### API Usage ```C# public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer) { writer.WriteStartObject(); writer.WritePropertyName("data"); using (Stream binaryStream = writer.CreateBinaryWriteStream()) { StreamBinaryData(reader, binaryStream); binaryStream.Flush(); } writer.WriteEndObject(); } public void StreamBinaryData(PipeReader reader, Stream stream) { List bytes = new List(); while (reader.TryRead(out ReadResult result)) { ReadOnlySequence buffer = result.Buffer; byte[] byteArray = buffer.ToArray(); stream.Write(byteArray, 0, byteArray.Length); stream.Flush(); // Tell the PipeReader how much of the buffer has been consumed. reader.AdvanceTo(buffer.End); // Stop reading if there's no more data coming. if (result.IsCompleted) { break; } } // Mark the PipeReader as complete. await reader.Complete(); } ``` ### Alternative Designs _No response_ ### Risks _No response_

Author:	anhadi2
Assignees:	-
Labels:	`api-suggestion`, `area-System.Text.Json`, `untriaged`
Milestone:	-

teo-tsirpanis commented 2 years ago

I think it should be an IBufferWriter<byte> instead of a Stream. Even better, a struct implementing IBufferWriter, with a Complete method that tells the Utf8JsonWriter that writing the big string finished.

FiniteReality commented 2 years ago

with a Complete method that tells the Utf8JsonWriter that writing the big string finished.

Personally, I prefer the idea of using IDisposable here and then having the Dispose method act as the Complete method you're talking about. This is similar to how other types work like log scopes.

teo-tsirpanis commented 2 years ago

That's also an option, since this is the type's only purpose.

ericstj commented 2 years ago

I have a requirement to write large binary content in json.

What's the reason for this constraint? Wouldn't it be more efficient to serve it up directly instead of embedding in a JSON payload?

anhadi2 commented 2 years ago

I have a requirement to write large binary content in json.

What's the reason for this constraint? Wouldn't it be more efficient to serve it up directly instead of embedding in a JSON payload?

We want to serve multiple binary content in the JSON with some additional fields. The actual JSON will contain some more fields and response looks like:

{
  "value": [
    {
      "field1": "some_small_string",
      "data": "large_base64_encoded_string"
    },
    {
      "field1": "some_small_string",
      "data": "large_base64_encoded_string"
    }
  ]
}

krwq commented 2 years ago

@anhadi2 wouldn't it make more sense to append that data after the JSON is complete? I.e.:

{
  "value": [
    {
      "field1": "some_small_string",
      "data": "$binary_payload"
      //or "binary_payload": true or some other marker to tell "I'll be writing rest later"
    },
    {
      "field2": "some_small_string",
      "data": "$binary_payload"
    }
  ]
}
<base64 of payload for field 1, this could even be 4 bytes of length + data directly>
<base64 of payload for field 2>

anhadi2 commented 2 years ago

@anhadi2 wouldn't it make more sense to append that data after the JSON is complete? I.e.:

{
  "value": [
    {
      "field1": "some_small_string",
      "data": "$binary_payload"
      //or "binary_payload": true or some other marker to tell "I'll be writing rest later"
    },
    {
      "field2": "some_small_string",
      "data": "$binary_payload"
    }
  ]
}
<base64 of payload for field 1, this could even be 4 bytes of length + data directly>
<base64 of payload for field 2>

This will not work for us. We want the response to be a valid JSON.

teo-tsirpanis commented 2 years ago

Something you could do is put a URL to the data in your JSON instead of the data themselves. The data will be transmitted in binary and much more efficiently than Base64.

Unless you want to persist this JSON file.

anhadi2 commented 2 years ago

Something you could do is put a URL to the data in your JSON instead of the data themselves. The data will be transmitted in binary and much more efficiently than Base64.

Unless you want to persist this JSON file.

Thanks for the suggestion. Yes, there might be other ways to do this. However, the response format is already decided and we cannot change it at this stage. Hence looking for an efficient way to transfer the base 64 encoded binary payload as a part of the JSON response body.

krwq commented 2 years ago

@anhadi2 would https://github.com/dotnet/runtime/issues/68223 help for your use case?

ghost commented 2 years ago

This issue has been marked needs-author-action and may be missing some important information.

ghost commented 2 years ago

This issue has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

ghost commented 2 years ago

This issue will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.

oocx commented 1 year ago

Such an api would be useful for us as well. We are sending data to a 3rd party api that we cannot change, and that api expects files to be sent base64 encoded as part of a json object. The documents can be up to 100 MB in size, we'd like to avoid having to load the complete document into memory.

mriehm commented 1 year ago

We are in the same position of sending large base64 data in JSON to a 3rd party and being unable to change the contract.

@krwq You added the needs-author-action tag which seems to have led to this issue being closed. Since a couple other people have responded now requesting this feature, can it be reopened?

To answer your question about https://github.com/dotnet/runtime/issues/68223, it would not solve the issue that all the data needs to be present in memory at once.

In the meantime I'll say to anyone else searching for a solution, if you have access to the underlying output Stream, then this ugly workaround should be viable:

static void WriteStreamValue(Stream outputStream, Utf8JsonWriter writer, Stream inputStream)
{
    writer.Flush();
    outputStream.Write(Encoding.UTF8.GetBytes("\""));

    using (CryptoStream base64Stream = new CryptoStream(outputStream, new ToBase64Transform(), CryptoStreamMode.Write, true))
    {
        inputStream.CopyTo(base64Stream);
    }

    writer.WriteRawValue("\"", skipInputValidation: true);
}

teo-tsirpanis commented 1 year ago

Reopening.

ghost commented 1 year ago

This issue has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

davidfowl commented 1 year ago

What does the consumption side of this look like?

anhadi2 commented 1 year ago

@davidfowl We are going to use this in JsonConverter where we get instance of Utf8JsonWriter. We want to have a custom JSON writer for our response object. Reference: https://docs.microsoft.com/en-us/dotnet/standard/serialization/system-text-json-converters-how-to?pivots=dotnet-5-0#steps-to-follow-the-basic-pattern

davidfowl commented 1 year ago

you dont care about getting a stream back there right?

anhadi2 commented 1 year ago

Here is what I think we would want to do at the consumption side:

Tell Utf8JsonWriter that we want to write a large binary data.(Get some object)
Start writing binary chunks.
Tell Utf8JsonWriter that we have finished writing the binary data. Also, we should have a way to keep flushing data periodically to keep the overall memory low.

In step 1, we don't necessarily need a Stream back. Any object on which we can write binary data should work. For step 2, since we would be sending binary chunks, Utf8JsonWriter would need to handle chunked base64 encoding.

eiriktsarpalis commented 1 year ago

If we do add such an API, we should make sure that the built-in StringConverter uses it once it detects that the string size exceeds some threshold. In other words, we should ensure that the converter is resumable and supports streaming serialization. Given that strings are common, we should be careful that this doesn't regress performance in the 99.99% case where strings are small.

eiriktsarpalis commented 7 months ago

API Proposal

public partial class Utf8JsonWriter
{
    void WriteStringValue(ReadOnlySpan<char> value);
    void WriteStringValue(ReadOnlySpan<byte> value);
    void WriteBase64String(ReadOnlySpan<byte> value);

+    void WriteStringValueSegment(ReadOnlySpan<char> value, bool isFinalSegment);
+    void WriteStringValueSegment(ReadOnlySpan<byte> value, bool isFinalSegment); 
+    void WriteBase64StringSegment(ReadOnlySpan<byte> value, bool isFinalSegment);
}

API Usage

We can then update the built-in StringConverter to take advantage of the new API like so:

public class StringConverter : JsonResumableConverter<string>
{
    internal override bool OnTryWrite(
        Utf8JsonWriter writer,
        string value,
        JsonSerializerOptions options,
        ref WriteStack state)
    {
        int maxChunkSize = options.MaxChunkSize; // Some constant or value deriving from JsonSerializerOptions.DefaultBufferSize
        if (value is null or { Length: <= maxChunkSize })
        {
              Write(value, options); // Usual Write routine
              return true;
        }

        // Fall back to chunked writes
        int charsWritten = state.CharsWritten;
        ReadOnlySpan<char> remaining = value.AsSpan(charsWritten);
        while (remaining.Length > maxChunkSize)
        {
              ReadOnlySpan<char> chunk = remaining.Slice(0, maxChunkSize);
              writer.WriteStringValueSegment(chunk, isFinalSegment: false);
              remaining = remaining.Slice(maxChunkSize);
              charsWritten += maxChunkSize;

             if (ShouldFlush(writer, ref state))
             {
                 // Commit partial write info to the state object
                 state.CharsWritten += charsWritten;
                 return false;
             }
        }

        writer.WriteStringValueSegment(remaining, isFinalSegment: true);
        return true;
    }
}

It seems possible we could invent a dual concept for reading large strings as well, but that would require augmenting Utf8JsonReader with partially loaded string tokens. I'm not sure how feasible that would be.

Remarks

Checking that the segments form valid UTF8/Base64 strings should be made by Utf8JsonWriter which would require tracking state for partial character encodings/surrogate pairs and Base64 output padding.

eiriktsarpalis commented 7 months ago

It seems possible we could invent a dual concept for reading large strings as well, but that would require augmenting Utf8JsonReader with partially loaded string tokens. I'm not sure how feasible that would be.

To elaborate a bit more on the above, here's what a chunking API might look like on Utf8JsonReader:

public enum JsonTokenType
{
    String,
+    StringSegment,
}

public ref struct Utf8JsonReader
{
    public int CopyString(Span<byte> destination);
+    public int CopyStringSegment(Span<byte> destination, out bool isFinalSegment);
}

Which could be consumed by the converter as follows:

public class StringConverter : JsonResumableConverter<string>
{
    internal override bool OnTryRead(
        ref Utf8JsonReader reader,
        Type typeToConvert,
        JsonSerializerOptions options,
        ref ReadStack state,
        [MaybeNullWhen(false)] out string? value)
    {
        if (reader.TokenType is JsonTokenType.Null or JsonTokenType.String)
        {
            value = reader.GetString();
            return true;
        }

        if (reader.TokenType is not JsonTokenType.StringSegment)
        {
            throw new InvalidOperationException();
        }

        List<byte[]> chunks = state.Chunks ??= [];
        while (true)
        {
            Debug.Assert(reader.TokenType is JsonTokenType.StringSegment);
            byte[] chunk = new byte[reader.ValueSpan.Length]; // TODO buffer pooling
            reader.CopyStringSegment(chunk, out bool isFinalSegment);

            if (isFinalSegment)
            {
                // Imaginary API decoding a string from UTF-8 fragments.
                value = Encoding.UTF8.GetStringFromChunks(chunks);
                return true;
            }

            if (!reader.Read())
            {
                value = null;
                return false;
            }
        }
    }
}

The problem is, I can't think of a way to include the new token type without breaking existing Utf8JsonReader consumers.

davidfowl commented 7 months ago

@eiriktsarpalis This wouldn't be usable with custom converters in its current form. This works for types the JSON serializer knows about.

benaadams commented 7 months ago

We have a similar issue here with a chain of custom converters and large amounts of data (0xHEX rather than base64); one of the additional issues is even doing an await JsonSerializer.SerializeAsync while in the converters it builds up the entire tree of Json in that property to memory rather than flushing to the underlying Stream

eiriktsarpalis commented 7 months ago

The proposed APIs are necessary but not sufficient to support large string streaming in custom converters. The second missing ingredient is making resumable converters public, tracked by https://github.com/dotnet/runtime/issues/63795.

terrajobst commented 6 months ago

Looks good as proposed
The methods have to track the state necessary to know
- when to put an open quote
- how to parse code points across span boundaries
This allows writing large strings, but this doesn't support reading them yet. Do we need to add this as well?
- The Utf8JsonReader will fetch more buffers until the next token is contained in a single contiguous buffer, which wouldn't be efficient for large strings
- We can solve this by letting the reader early with a new token type (such as StringSegment) but this would break consumers that aren't handling this token type, but we could make this opt-in on the reader
- Ideally that opt in isn't state on the reader but is a function of what API is called (e.g. TryGetStringSegment / GetStringSegment()) because it allows multiple parties to opt-in individually (e.g. some converter doesn't but continues to work, a new token type would make everyone throw).

namespace System.Text.Json;

public partial class Utf8JsonWriter
{
    // Existing
    // public void WriteStringValue(ReadOnlySpan<char> value);
    // public void WriteStringValue(ReadOnlySpan<byte> value);
    // public void WriteBase64String(ReadOnlySpan<byte> value);

    public void WriteStringValueSegment(ReadOnlySpan<char> value, bool isFinalSegment);
    public void WriteStringValueSegment(ReadOnlySpan<byte> value, bool isFinalSegment); 
    public void WriteBase64StringSegment(ReadOnlySpan<byte> value, bool isFinalSegment);
}

JustDre commented 5 months ago

I came across this issue when trying to create a JsonConverter<Stream>. It seems to me the simplest API would just be

public void WriteBase64String(Stream value);

No need to pass a flag as the Stream is self-describing. It makes for efficient CopyTo, chaining, etc., and is memory-efficient as long as all the participating Stream types are chunking.

habbes commented 5 months ago

@eiriktsarpalis @terrajobst We're facing similar issues with OData libraries for some customers who have large text or byte[] fields in their payloads. I like the WriteStringValueSegment API. We would want to be able to write large string/byte[] values in chunks and flush periodically to avoid resizing the buffer. We would like to flush/write to the stream asynchronously to avoid synchronous I/O.

Tragetaschen commented 2 months ago

Back in the day, https://github.com/dotnet/runtime/issues/68223#issuecomment-1216955662 removed the

public void WriteRawValue(ReadOnlySequence<char> json, bool skipInputValidation = false);

overload due to the surrogate pair handling necessary for that. This new API here would require that anyways and it would also be the building block for this method's implementation. It might be a good time to reintroduce this.

It would also tie in nicely with https://github.com/dotnet/runtime/issues/97570 for example

SteveSandersonMS commented 1 week ago

AI-related use case:

When receiving a response from an LLM in JSON format (e.g., with response format = json_object for OpenAI), it might represent something you want to process in a streaming way, e.g., if it's a chatbot answering a question and you want it to show up in realtime in the UI. For example it might be returning a JSON object like:

{ "confidence": 1, "citations": [...], "answer_text": "A really long string goes here, so long that you want to see it arrive incrementally in the UI" }

Currently that's not really viable to do with System.Text.Json.

While we don't have an API design in mind, here are some possibilities:

We could support this at the JsonReader level only. It would have to be able to read from a Stream and have some methods like ReadStringChunkAsync that gives you the next block from a string you're currently reading.
Or we could support it in the deserializer by mapping to properties of type Stream. However it's unclear how the developer could know which of the output object properties have been populated by the time the stream starts arriving, as the response object properties could appear in any order.

Evidence required

TBH we're still short on evidence that people really need to do this when working with an LLM, because:

As far as we know, the main reason to want to do streaming at all when working with an LLM is to present the output in realtime in a chat-style UI. But in most cases a chat response will be done as plain text, not JSON. Even if the developer is thinking of using JSON because they want structured info (e.g., an array of citations, or a flag to indicate something about the response type), there are usually better alternatives. Examples:
- If you want flags indicating the nature of the response, either have that determined by a JSON-returning planning phase that executes before you ask the LLM for its final response, or tell the LLM to embed the flag in its response in some format you decide (e.g., <is_low_certainty>).
- If you want structured citations, tell the LLM to return them inline in the text in a known format you can find with a regex, e.g., <cite id="123">text</cite>.
In non-chat scenarios, non-UI code will simply wait until the response completes before it takes action on it, so it doesn't need to parse in a streaming way.
Even if you could do this via Utf8JsonReader, it would result in extremely painful code since you couldn't use a regular deserialization API and would instead have to write some kind of state machine that interprets a single specific JSON format.

Here's one bit of evidence that people will want to parse large strings in streaming JSON: https://www.boundaryml.com/blog/nextjs-rag-streaming-example. Again, it's not the only way to do this, but suggests some people will want to.

If anyone reading this has clear examples of LLM-related cases where they find it desirable to process a JSON response in a streaming way, please let us know!

dotnet / runtime