Closed mwadams closed 1 month ago
Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis See info in area-owners.md if you want to be subscribed.
Author: | mwadams |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Text.Json` |
Milestone: | - |
One usage observation: developers may wish to implement extension methods that hide the underlying parsers from their ultimate clients.
public static bool TryGetCustomValue(this JsonElement element, out CustomValue value) {}
This would give these custom value converters the same signature for end users as the built-in APIs.
For a quantifiable benefit: if implemented, these APIs would enable me to eliminate allocations in three hot-paths in my JsonSchema validation.
In one of our benchmarks, we validate an array of 10,000 "Person" objects.
The Person has a FirstName, LastName, and a DateOfBirth.
The DateOfBirth is a JSON string in date
format which we project to a NodaTime local date type.
The name elements are constrained to a maxLength
.
Validation causes 40,000 string allocations at a cost of ~3MB (which are otherwise unused because we write the values directly into the output stream - a very common use case).
Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|
FindWholeArrayLoopSchemaGenValidateDeserialize | 20.050 ms | 0.0598 ms | 0.0530 ms | 312.5000 | - | - | 2,792,162 B |
With this API this reduces to zero.
As an example of the flexibility of this API, here's a projection that would allow us to do Rune
counting of the decoded JSON string, avoiding a string allocation (another requirement for json-schema validation).
public static class JsonElementExtensions
{
public static int CountRunes(this JsonElement element)
{
if (element.TryGetValue(RuneCounter, default(object?), out int result))
{
return result;
}
throw new InvalidOperationException();
}
private static bool RuneCounter(ReadOnlySpan<char> input, object? state, out int result)
{
int runeCount = 0;
SpanRuneEnumerator enumerator = input.EnumerateRunes();
while (enumerator.MoveNext())
{
runeCount++;
}
result = runeCount;
return true;
}
}
In our actual implementation we would create a whole 'string validator' type that would also do e.g. pattern matching using the new net7 Regex.IsMatch(ReadOnlySpan<T>)
. This is a highly simplified example that illustrates the benefits of building over this API.
public static class JsonElementExtensions
{
public static ValidationContext ValidateString(this JsonElement element, in ValidationContext validationContext)
{
if (element.TryGetValue(StringValidator, validationContext, out ValidationContext result))
{
return result;
}
throw new InvalidOperationException();
}
private static Regex SomePattern = new Regex("H(.*)o", RegexOptions.Compiled, TimeSpan.FromSeconds(3));
private static bool StringValidator(ReadOnlySpan<char> input, in ValidationContext context, out ValidationContext result)
{
// Emitted if minLength or maxLength
int runeCount = 0;
SpanRuneEnumerator enumerator = input.EnumerateRunes();
while (enumerator.MoveNext())
{
runeCount++;
}
result = context.WithResult(runeCount < 10);
// Emitted if has pattern match validation Net7 Regex
result = result.WithResult(SomePattern.IsMatch(input));
return true;
}
}
I think I'd like to emphasize the benefit of this API for "in-the-box" types.
By allowing maintainers to implement TryGetXXX() methods without having to update System.Text.Json.JsonElement
, and yet maintaining close-to-optimal performance, support for new types (like the Date or Time-onlies) could be provided out-of-band with framework releases, and without polluting the (otherwise ever-growing) JsonElement
native API.
If there was a strong reason to then roll those into the JsonElement
API itself, that could be done with the next major version, and the Extension method deprecated, and implemented over the new API for backwards compatibility.
I thought it might be useful to see the practical benefit of this.
We have an internal implementation of this API, along these lines; it offers a good tradeoff with allocation (approx zero in steady state) and performance.
/// <summary>
/// Process raw JSON text.
/// </summary>
/// <typeparam name="TState">The type of the state for the processor.</typeparam>
/// <typeparam name="TResult">The type of the result of processing.</typeparam>
/// <param name="element">The json element to process.</param>
/// <param name="state">The state passed to the processor.</param>
/// <param name="callback">The processing callback.</param>
/// <param name="result">The result of processing.</param>
/// <returns><c>True</c> if the processing succeeded, otherwise false.</returns>
public static bool ProcessRawText<TState, TResult>(
this JsonElement element,
in TState state,
in Utf8Parser<TState, TResult> callback,
[NotNullWhen(true)] out TResult? result)
{
if (UseReflection)
{
return callback(GetRawValue(element).Span, state, out result);
}
else
{
PooledWriter? writerPair = null;
try
{
writerPair = WriterPool.Get();
(Utf8JsonWriter w, ArrayPoolBufferWriter<byte> writer) = writerPair.Get();
element.WriteTo(w);
w.Flush();
return callback(writer.WrittenSpan[1..^1], state, out result);
}
finally
{
if (writerPair is not null)
{
WriterPool.Return(writerPair);
}
}
}
}
I have added the "UseReflection" path for benchmarking. If we choose that, it emits some code to get the underlying memory from the JsonDocument, as if this was part of the internals of the TryGetValue()
method proposed here.
private static GetRawValueDelegate BuildGetRawValue()
{
Type returnType = typeof(ReadOnlyMemory<byte>);
Type[] parameterTypes = new[] { typeof(JsonElement) };
Type jsonElementType = typeof(JsonElement);
Type jsonDocumentType = typeof(JsonDocument);
FieldInfo parentField = jsonElementType.GetField("_parent", BindingFlags.Instance | BindingFlags.NonPublic) ?? throw new InvalidOperationException("Unable to find JsonElement._parent field");
FieldInfo idxField = jsonElementType.GetField("_idx", BindingFlags.Instance | BindingFlags.NonPublic) ?? throw new InvalidOperationException("Unable to find JsonElement._idx field");
MethodInfo getRawValueMethod = jsonDocumentType.GetMethod("GetRawValue", BindingFlags.Instance | BindingFlags.NonPublic) ?? throw new InvalidOperationException("Unable to find JsonDocument.GetRawValue method");
var getRawText = new DynamicMethod("GetRawValue", returnType, parameterTypes, typeof(LowAllocJsonUtils).Module, true);
ILGenerator il = getRawText.GetILGenerator(256);
LocalBuilder idx = il.DeclareLocal(typeof(int));
il.Emit(OpCodes.Ldarg_0); // Get the JSON Element argument
il.Emit(OpCodes.Ldfld, idxField);
il.Emit(OpCodes.Stloc_0, idx); // Store the index in a local variable
il.Emit(OpCodes.Ldarg_0); // Get the JSON Element argument again
il.Emit(OpCodes.Ldfld, parentField); // Put the parent on the stack
il.Emit(OpCodes.Ldloc_0, idx); // Put the index on the stack from the local variable
il.Emit(OpCodes.Ldc_I4_0); // Put "false" onto the stack
il.Emit(OpCodes.Callvirt, getRawValueMethod);
il.Emit(OpCodes.Ret);
return getRawText.CreateDelegate<GetRawValueDelegate>();
}
As you can see, it simply gets the gets the underlying raw value from the parent JsonDocument
.
The benchmarks show a 20-30% benefit.
(As a reminder - this is an underlying means of accessing the raw text - we optionally decode and/or translate to RoS<char>
further up the implementation; these benchmarks include that decoding and transcoding - they are performing json-schema document validation.)
Those benchmarks were on net7.0. It's well worth comparing with net8.0 (this is the last preview, not RC1) We get huge perf improvements from the new runtime - I'm guessing that's a lot of vectorization improvements, and I also notice that our tiny bit of allocation that we need to do in the "everything is fine" case has got smaller, too!
That's pretty impressive for "no work" - and indeed is as much benefit as this proposed change would produce... except as you can see we get it again even on net8.0!
I should say that the Corvus.JsonSchema library is ours. JsonEverything is @gregsdennis excellent cornucopia of json goodness, also built over System.Text.Json.
With Corvus we are trying to make JsonSchema validation a "just do it and you don't notice" option for API implementers, with the added bonus of an object model over the document "for free" rather than having to hand build a separate set of POCOs for serialisation.
A validation pass for a "normal" sized document is a couple of microseconds and zero allocations.
For a 1.5Mb document it is a few 10s of milliseconds and a few 10s of bytes of allocation.
We are a bit obsessive about shaving as much as we can off that!
Blog post about net8 performance wins and potential impact of this change.
https://endjin.com/blog/2023/12/how-dotnet-8-boosted-json-schema-performance-by-20-percent-for-free
As we now have JsonMarshal
I believe that this issue can be closed.
I'm just preparing the next preview build of Corvus.JsonSchema which has support for .NET 9 (and uses the new JsonMarshal
type to optimize access to the underlying UTF8 JSON bytes).
Thought you might be interested in the benchmarks. The "Fast" benchmarks exercise a codepath using JsonMarshal
v. the old path with the workaround discussed in this issue.
Running on .NET 8 with the .NET9 System.Text.Json library and JsonMarshal
is 13% faster.
Running on .NET 9 RC2 nightly (from today) it is a whopping 32% faster.
[As usual, those very weird 12B and 17B allocations are a BenchmarkDotNet quirk - it doesn't actually alllocate at all]
This is on Windows 11 with a 13th Gen Intel(R) Core(TM) i7-13800H @ 2.90 GHz
| Method | Toolchain | Mean | Error | StdDev | Ratio | Allocated | Alloc Ratio |
|------------------------------- |--------------------- |----------:|----------:|----------:|------:|----------:|------------:|
| ValidateLargeArrayCorvusV4 | .NET 8.0 | 13.035 ms | 0.0272 ms | 0.0241 ms | 1.00 | 12 B | 1.00 |
| ValidateLargeArrayCorvusV4Fast | .NET 8.0 | 11.311 ms | 0.0536 ms | 0.0448 ms | 0.87 | 12 B | 1.00 |
| ValidateLargeArrayCorvusV4 | .NET 9.0 RC2 Nightly | 10.436 ms | 0.0226 ms | 0.0188 ms | 0.80 | 17 B | 1.42 |
| ValidateLargeArrayCorvusV4Fast | .NET 9.0 RC2 Nightly | 8.813 ms | 0.0526 ms | 0.0492 ms | 0.68 | 12 B | 1.00 |
For those interested, that's validating an array of 10,000 instances of this entity:
{
"name": {
"familyName": "Oldroyd",
"givenName": "Michael",
"otherNames": [],
"email": "michael.oldryoyd@contoso.com"
},
"dateOfBirth": "1944-07-14",
"netWorth": 1234567890.1234567891,
"height": 1.8
}
Against this schema (including full format validation for e.g. dates, email, numerics)
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "JSON Schema for a Person entity coming back from a 3rd party API (e.g. a storage format in a database)",
"type": "object",
"$ref": "#/$defs/Person",
"$defs": {
"PersonArray": {
"type": "array",
"items": {
"$ref": "#/$defs/Person"
}
},
"Person": {
"type": "object",
"required": [ "name" ],
"properties": {
"name": { "$ref": "#/$defs/PersonName" },
"dateOfBirth": {
"type": [ "string", "null" ],
"format": "date"
},
"email": {
"type": "string",
"format": "email"
},
"netWorth": {
"type": "number",
"format": "decimal"
},
"height": {
"$ref": "#/%24defs/HeightRangeDouble"
}
}
},
"HeightRangeDouble": {
"type": "number",
"minimum": 0,
"maximum": 3.0
},
"PersonName": {
"type": "object",
"description": "A name of a person.",
"required": [ "familyName" ],
"properties": {
"givenName": {
"$ref": "#/$defs/PersonNameElement",
"description": "The person's given name."
},
"familyName": {
"$ref": "#/$defs/PersonNameElement",
"description": "The person's family name."
},
"otherNames": {
"$ref": "#/$defs/OtherNames",
"description": "Other (middle) names for the person"
}
}
},
"OtherNames": {
"oneOf": [
{ "$ref": "#/$defs/PersonNameElement" },
{ "$ref": "#/$defs/PersonNameElementArray" }
]
},
"PersonNameElementArray": {
"type": "array",
"items": {
"$ref": "#/$defs/PersonNameElement"
}
},
"PersonNameElement": {
"type": "string",
"minLength": 1,
"maxLength": 256
}
}
}
Having said that, JsonProperty
remains a conundrum.
Because its Name
is either accessed via the preceding DbRow
of the Value
, or its own string
backing we have an anomaly that it may or may not require transcoding, which is, I presume, why we don't have a comparable raw accessor for the name.
Internally, most uses of the property name assume that it will be safe to assume a JsonElement-backed name (e.g. in DeepEquals()
) - so it would still be useful to have a simple JsonMarshal method like:
/// <summary>
/// Gets a <see cref="ReadOnlySpan{T}"/> view over the raw JSON data of the given <see cref="JsonProperty"/>.
/// </summary>
/// <param name="property">The JSON property from which to extract the span.</param>
/// <returns>The span containing the raw JSON data of <paramref name="property"/>.</returns>
/// <exception cref="ObjectDisposedException">The underlying <see cref="JsonDocument"/> has been disposed.</exception>
/// <remarks>
/// <para>
/// While the method itself does check for disposal of the underlying <see cref="JsonDocument"/>,
/// it is possible that it could be disposed after the method returns, which would result in
/// the span pointing to a buffer that has been returned to the shared pool. Callers should take
/// extra care to make sure that such a scenario isn't possible to avoid potential data corruption.
/// </para>
/// </remarks>
public static ReadOnlySpan<byte> GetRawUtf8PropertyName(JsonProperty property)
{
return property.NameSpan;
}
I have opened #107782 to contain this suggestion, and will close this issue.
Background and motivation
There have been a number of discussions about possible APIs for efficient access to the underlying UTF8 bytes for JSON string values represented by a
JsonElement
.So far, there has been no consensus on a suitable approach.
In general, the criticism of previous proposals has been either that:
Utf8String
), orThis API proposal focuses instead on a specific use-case - projecting from a JSON string value to a user-defined type, without an intermediate
string
allocation.The basic shape of the API should be familiar.
JsonElement
already has a number of such methods - projecting toDateTime
,DateTimeOffset
andGuid
, for example, with theGetXXX()
andTryGetXXX()
APIs.These proposed APIs offer developers the ability to project to user defined types, without requiring an intermediate conversion to string. They will handle decoding and transcoding (e.g. to a decoded
ReadOnlySpan<byte>
orReadOnlySpan<char>
), managing the memory on behalf of the developer.Importantly, they do not leak any information about the underlying backing for the JSON string value.
Common use cases might include:
They use a similar delegate-and-state pattern to
String.Create()
to encourage efficient usage, and to avoid accidental allocation of closure-capturing objects.API Proposal
Rough implementation of the additional internal methods on
JsonDocument
can be found here.https://github.com/mwadams/runtime/blob/55372e6b8e93a123513cfd6e6415cfd80c0630a4/src/libraries/System.Text.Json/src/System/Text/Json/Document/JsonDocument.cs#L660
API Usage
Alternative Designs
TState
value (e.g. defaulting to anull
value for anobject?
)string.Create()
, although this is perhaps a minor considerationTState
to be a anin
parameter, which would minimise copying in the case that your state was areadonly struct
. Personally, I would favour this variation.Risks
This should have no impact on the existing API surface area.
It should be possible to implement largely using the existing methods in
JsonDocument
andJsonReaderHelper
.There are comments in
JsonDocument
suggesting that there is a plan to move to common UTF8 encoding/decoding code; this would increase the surface area of methods inJsonDocument
that would need to switch to using those common APIs if/when that change occurs.