Open anhadi2 opened 2 years ago
Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.
Author: | anhadi2 |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.IO`, `untriaged` |
Milestone: | - |
Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis See info in area-owners.md if you want to be subscribed.
Author: | anhadi2 |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Text.Json`, `untriaged` |
Milestone: | - |
I think it should be an IBufferWriter<byte>
instead of a Stream
. Even better, a struct implementing IBufferWriter
, with a Complete
method that tells the Utf8JsonWriter
that writing the big string finished.
with a
Complete
method that tells theUtf8JsonWriter
that writing the big string finished.
Personally, I prefer the idea of using IDisposable
here and then having the Dispose
method act as the Complete
method you're talking about. This is similar to how other types work like log scopes.
That's also an option, since this is the type's only purpose.
I have a requirement to write large binary content in json.
What's the reason for this constraint? Wouldn't it be more efficient to serve it up directly instead of embedding in a JSON payload?
I have a requirement to write large binary content in json.
What's the reason for this constraint? Wouldn't it be more efficient to serve it up directly instead of embedding in a JSON payload?
We want to serve multiple binary content in the JSON with some additional fields. The actual JSON will contain some more fields and response looks like:
{
"value": [
{
"field1": "some_small_string",
"data": "large_base64_encoded_string"
},
{
"field1": "some_small_string",
"data": "large_base64_encoded_string"
}
]
}
@anhadi2 wouldn't it make more sense to append that data after the JSON is complete? I.e.:
{
"value": [
{
"field1": "some_small_string",
"data": "$binary_payload"
//or "binary_payload": true or some other marker to tell "I'll be writing rest later"
},
{
"field2": "some_small_string",
"data": "$binary_payload"
}
]
}
<base64 of payload for field 1, this could even be 4 bytes of length + data directly>
<base64 of payload for field 2>
@anhadi2 wouldn't it make more sense to append that data after the JSON is complete? I.e.:
{ "value": [ { "field1": "some_small_string", "data": "$binary_payload" //or "binary_payload": true or some other marker to tell "I'll be writing rest later" }, { "field2": "some_small_string", "data": "$binary_payload" } ] } <base64 of payload for field 1, this could even be 4 bytes of length + data directly> <base64 of payload for field 2>
This will not work for us. We want the response to be a valid JSON.
Something you could do is put a URL to the data in your JSON instead of the data themselves. The data will be transmitted in binary and much more efficiently than Base64.
Unless you want to persist this JSON file.
Something you could do is put a URL to the data in your JSON instead of the data themselves. The data will be transmitted in binary and much more efficiently than Base64.
Unless you want to persist this JSON file.
Thanks for the suggestion. Yes, there might be other ways to do this. However, the response format is already decided and we cannot change it at this stage. Hence looking for an efficient way to transfer the base 64 encoded binary payload as a part of the JSON response body.
@anhadi2 would https://github.com/dotnet/runtime/issues/68223 help for your use case?
This issue has been marked needs-author-action
and may be missing some important information.
This issue has been automatically marked no-recent-activity
because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity
.
This issue will now be closed since it had been marked no-recent-activity
but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.
Such an api would be useful for us as well. We are sending data to a 3rd party api that we cannot change, and that api expects files to be sent base64 encoded as part of a json object. The documents can be up to 100 MB in size, we'd like to avoid having to load the complete document into memory.
We are in the same position of sending large base64 data in JSON to a 3rd party and being unable to change the contract.
@krwq You added the needs-author-action tag which seems to have led to this issue being closed. Since a couple other people have responded now requesting this feature, can it be reopened?
To answer your question about https://github.com/dotnet/runtime/issues/68223, it would not solve the issue that all the data needs to be present in memory at once.
In the meantime I'll say to anyone else searching for a solution, if you have access to the underlying output Stream, then this ugly workaround should be viable:
static void WriteStreamValue(Stream outputStream, Utf8JsonWriter writer, Stream inputStream)
{
writer.Flush();
outputStream.Write(Encoding.UTF8.GetBytes("\""));
using (CryptoStream base64Stream = new CryptoStream(outputStream, new ToBase64Transform(), CryptoStreamMode.Write, true))
{
inputStream.CopyTo(base64Stream);
}
writer.WriteRawValue("\"", skipInputValidation: true);
}
Reopening.
This issue has been automatically marked no-recent-activity
because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity
.
What does the consumption side of this look like?
@davidfowl We are going to use this in JsonConverter
where we get instance of Utf8JsonWriter
. We want to have a custom JSON writer for our response object. Reference: https://docs.microsoft.com/en-us/dotnet/standard/serialization/system-text-json-converters-how-to?pivots=dotnet-5-0#steps-to-follow-the-basic-pattern
you dont care about getting a stream back there right?
Here is what I think we would want to do at the consumption side:
In step 1, we don't necessarily need a Stream
back. Any object on which we can write binary data should work.
For step 2, since we would be sending binary chunks, Utf8JsonWriter would need to handle chunked base64 encoding.
If we do add such an API, we should make sure that the built-in StringConverter
uses it once it detects that the string size exceeds some threshold. In other words, we should ensure that the converter is resumable and supports streaming serialization. Given that strings are common, we should be careful that this doesn't regress performance in the 99.99% case where strings are small.
public partial class Utf8JsonWriter
{
void WriteStringValue(ReadOnlySpan<char> value);
void WriteStringValue(ReadOnlySpan<byte> value);
void WriteBase64String(ReadOnlySpan<byte> value);
+ void WriteStringValueSegment(ReadOnlySpan<char> value, bool isFinalSegment);
+ void WriteStringValueSegment(ReadOnlySpan<byte> value, bool isFinalSegment);
+ void WriteBase64StringSegment(ReadOnlySpan<byte> value, bool isFinalSegment);
}
We can then update the built-in StringConverter
to take advantage of the new API like so:
public class StringConverter : JsonResumableConverter<string>
{
internal override bool OnTryWrite(
Utf8JsonWriter writer,
string value,
JsonSerializerOptions options,
ref WriteStack state)
{
int maxChunkSize = options.MaxChunkSize; // Some constant or value deriving from JsonSerializerOptions.DefaultBufferSize
if (value is null or { Length: <= maxChunkSize })
{
Write(value, options); // Usual Write routine
return true;
}
// Fall back to chunked writes
int charsWritten = state.CharsWritten;
ReadOnlySpan<char> remaining = value.AsSpan(charsWritten);
while (remaining.Length > maxChunkSize)
{
ReadOnlySpan<char> chunk = remaining.Slice(0, maxChunkSize);
writer.WriteStringValueSegment(chunk, isFinalSegment: false);
remaining = remaining.Slice(maxChunkSize);
charsWritten += maxChunkSize;
if (ShouldFlush(writer, ref state))
{
// Commit partial write info to the state object
state.CharsWritten += charsWritten;
return false;
}
}
writer.WriteStringValueSegment(remaining, isFinalSegment: true);
return true;
}
}
It seems possible we could invent a dual concept for reading large strings as well, but that would require augmenting Utf8JsonReader
with partially loaded string tokens. I'm not sure how feasible that would be.
Checking that the segments form valid UTF8/Base64 strings should be made by Utf8JsonWriter
which would require tracking state for partial character encodings/surrogate pairs and Base64 output padding.
It seems possible we could invent a dual concept for reading large strings as well, but that would require augmenting Utf8JsonReader with partially loaded string tokens. I'm not sure how feasible that would be.
To elaborate a bit more on the above, here's what a chunking API might look like on Utf8JsonReader
:
public enum JsonTokenType
{
String,
+ StringSegment,
}
public ref struct Utf8JsonReader
{
public int CopyString(Span<byte> destination);
+ public int CopyStringSegment(Span<byte> destination, out bool isFinalSegment);
}
Which could be consumed by the converter as follows:
public class StringConverter : JsonResumableConverter<string>
{
internal override bool OnTryRead(
ref Utf8JsonReader reader,
Type typeToConvert,
JsonSerializerOptions options,
ref ReadStack state,
[MaybeNullWhen(false)] out string? value)
{
if (reader.TokenType is JsonTokenType.Null or JsonTokenType.String)
{
value = reader.GetString();
return true;
}
if (reader.TokenType is not JsonTokenType.StringSegment)
{
throw new InvalidOperationException();
}
List<byte[]> chunks = state.Chunks ??= [];
while (true)
{
Debug.Assert(reader.TokenType is JsonTokenType.StringSegment);
byte[] chunk = new byte[reader.ValueSpan.Length]; // TODO buffer pooling
reader.CopyStringSegment(chunk, out bool isFinalSegment);
if (isFinalSegment)
{
// Imaginary API decoding a string from UTF-8 fragments.
value = Encoding.UTF8.GetStringFromChunks(chunks);
return true;
}
if (!reader.Read())
{
value = null;
return false;
}
}
}
}
The problem is, I can't think of a way to include the new token type without breaking existing Utf8JsonReader
consumers.
@eiriktsarpalis This wouldn't be usable with custom converters in its current form. This works for types the JSON serializer knows about.
We have a similar issue here with a chain of custom converters and large amounts of data (0xHEX
rather than base64); one of the additional issues is even doing an await JsonSerializer.SerializeAsync
while in the converters it builds up the entire tree of Json in that property to memory rather than flushing to the underlying Stream
The proposed APIs are necessary but not sufficient to support large string streaming in custom converters. The second missing ingredient is making resumable converters public, tracked by https://github.com/dotnet/runtime/issues/63795.
Utf8JsonReader
will fetch more buffers until the next token is contained in a single contiguous buffer, which wouldn't be efficient for large stringsStringSegment
) but this would break consumers that aren't handling this token type, but we could make this opt-in on the readerTryGetStringSegment
/ GetStringSegment()
) because it allows multiple parties to opt-in individually (e.g. some converter doesn't but continues to work, a new token type would make everyone throw).namespace System.Text.Json;
public partial class Utf8JsonWriter
{
// Existing
// public void WriteStringValue(ReadOnlySpan<char> value);
// public void WriteStringValue(ReadOnlySpan<byte> value);
// public void WriteBase64String(ReadOnlySpan<byte> value);
public void WriteStringValueSegment(ReadOnlySpan<char> value, bool isFinalSegment);
public void WriteStringValueSegment(ReadOnlySpan<byte> value, bool isFinalSegment);
public void WriteBase64StringSegment(ReadOnlySpan<byte> value, bool isFinalSegment);
}
I came across this issue when trying to create a JsonConverter<Stream>
. It seems to me the simplest API would just be
public void WriteBase64String(Stream value);
No need to pass a flag as the Stream is self-describing. It makes for efficient CopyTo, chaining, etc., and is memory-efficient as long as all the participating Stream types are chunking.
@eiriktsarpalis @terrajobst We're facing similar issues with OData libraries for some customers who have large text or byte[] fields in their payloads. I like the WriteStringValueSegment
API. We would want to be able to write large string/byte[] values in chunks and flush periodically to avoid resizing the buffer. We would like to flush/write to the stream asynchronously to avoid synchronous I/O.
Back in the day, https://github.com/dotnet/runtime/issues/68223#issuecomment-1216955662 removed the
public void WriteRawValue(ReadOnlySequence<char> json, bool skipInputValidation = false);
overload due to the surrogate pair handling necessary for that. This new API here would require that anyways and it would also be the building block for this method's implementation. It might be a good time to reintroduce this.
It would also tie in nicely with https://github.com/dotnet/runtime/issues/97570 for example
AI-related use case:
When receiving a response from an LLM in JSON format (e.g., with response format = json_object
for OpenAI), it might represent something you want to process in a streaming way, e.g., if it's a chatbot answering a question and you want it to show up in realtime in the UI. For example it might be returning a JSON object like:
{ "confidence": 1, "citations": [...], "answer_text": "A really long string goes here, so long that you want to see it arrive incrementally in the UI" }
Currently that's not really viable to do with System.Text.Json.
While we don't have an API design in mind, here are some possibilities:
Stream
and have some methods like ReadStringChunkAsync
that gives you the next block from a string you're currently reading.Stream
. However it's unclear how the developer could know which of the output object properties have been populated by the time the stream starts arriving, as the response object properties could appear in any order.TBH we're still short on evidence that people really need to do this when working with an LLM, because:
<is_low_certainty>
).<cite id="123">text</cite>
.Utf8JsonReader
, it would result in extremely painful code since you couldn't use a regular deserialization API and would instead have to write some kind of state machine that interprets a single specific JSON format.Here's one bit of evidence that people will want to parse large strings in streaming JSON: https://www.boundaryml.com/blog/nextjs-rag-streaming-example. Again, it's not the only way to do this, but suggests some people will want to.
If anyone reading this has clear examples of LLM-related cases where they find it desirable to process a JSON response in a streaming way, please let us know!
@eiriktsarpalis I'm interested in contributing to this, our codebase still relies on less-than-ideal workarounds. I see that there was another PR that attempted to address this and was closed. I'd like to get some context on why the PR was closed to I'm well aligned with expectations. Also wanted to confirm that the scope of approved APIs only covers Utf8JsonWriter and not JsonSerializer or JsonConverters or Utf8JsonReader, is that correct? Also, the different proposed methods for Utf8JsonWriter
can be implemented in separate PRs (for ease of review and workload management), is that correct?
@habbes can you point out the PR you're referring to? I couldn't immediately find it skimming through the issue history.
@eiriktsarpalis this one: https://github.com/dotnet/runtime/pull/101356
Seems like it was auto-closed due to lack of activity after receiving some comments and being marked as draft.
EDIT See https://github.com/dotnet/runtime/issues/67337#issuecomment-1812408212 for an API proposal.
Background and motivation
I have a requirement to write large binary content in json. In order to do this, I need to encode it to base64 before writing. The resultant json looks like this:
I have a PipeReader using which I read bytes in loop and keep appending to a list of bytes. I then convert the list into a byte array, then convert it to base64 string and use
WriteStringValue
to write it.The problem with this approach is excessive memory consumption. We need to keep the whole binary content in memory, convert to base64 and then write it.
Memory consumption is critical when using Utf8JsonWriter in the override of
JsonConverter.Write()
method in a web application.Instead I am proposing a way to stream large binary content.
API Proposal
API Usage
Alternative Designs
No response
Risks
No response