Consider clearing JsonSerializerOptions.Caching cache using a timer, not just based on incoming calls

SteveSandersonMS commented 2 years ago

First, it's totally possible I'm misinterpreting what STJ does here. We got this report saying that a Blazor Server app was holding hundreds of MB in memory for longer than expected, and using the VS memory inspection tools, it looks like at least one (possibly the only) thing pinning this data is rooted in JsonSerializerOptions's s_cache. The cache uses WeakReference for values, but the problem is with the keys.

It looks like the only thing that clears dangling entries is subsequent calls into the JSON serializer. This is likely sufficient in server scenarios where the server will never stop processing requests for any long duration, and even if it does, people probably don't care about freeing any memory until the server experiences memory pressure. However, for desktop scenarios, an app may stop doing things for a long time, and during that period, people really want it to release memory that's no longer required.

Suggestion

Instead of triggering TryEvictDanglingEntries only when there's a call to GetCachingContext, consider also having some guarantee like "it will always clear dangling entries within 10s even if there are no such incoming calls" or similar.

This isn't blocking anything in Blazor directly, but would be a good solution for people using STJ directly in desktop apps, or Blazor Server as an implementation detail of a desktop app (like this person).

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis See info in area-owners.md if you want to be subscribed.

Issue Details

First, it's totally possible I'm misinterpreting what STJ does here. We got [this report saying that a Blazor Server app was holding hundreds of MB in memory for longer than expected](https://github.com/dotnet/aspnetcore/issues/44062), and using the VS memory inspection tools, it looks like at least one (possibly the only) thing pinning this data is rooted in [`JsonSerializerOptions`'s `s_cache`](https://github.com/dotnet/runtime/blob/main/src/libraries/System.Text.Json/src/System/Text/Json/Serialization/JsonSerializerOptions.Caching.cs#L172). The cache uses `WeakReference` for *values*, but the problem is with the *keys*. It looks like the only thing that clears dangling entries is subsequent calls into the JSON serializer. This is likely sufficient in server scenarios where the server will never stop processing requests for any long duration, and even if it does, people probably don't care about freeing any memory until the server experiences memory pressure. However, for **desktop scenarios**, an app may stop doing things for a long time, and during that period, people really want it to release memory that's no longer required. ### Suggestion Instead of triggering `TryEvictDanglingEntries` only when there's a call to `GetCachingContext`, consider also having some guarantee like "it will always clear dangling entries within 10s even if there are no such incoming calls" or similar. This isn't blocking anything in Blazor directly, but would be a good solution for people using STJ directly in desktop apps, or Blazor Server as an implementation detail of a desktop app (like [this person](https://github.com/dotnet/aspnetcore/issues/44062)).

Author:	SteveSandersonMS
Assignees:	-
Labels:	`area-System.Text.Json`
Milestone:	-

SteveSandersonMS commented 2 years ago

Also, if this caching behavior was introduced in 7.0, then it could be argued that it's a regression. (Subjective, I know.)

jkotas commented 2 years ago

Is the actual problem the shortcut described in https://github.com/dotnet/runtime/blob/main/src/libraries/System.Text.Json/src/System/Text/Json/Serialization/JsonSerializerOptions.Caching.cs#L165-L167? Should this code rather use ConditionalWeakTable to avoid unnecessarily keeping the keys around?

eiriktsarpalis commented 2 years ago

Thanks @SteveSandersonMS, your analysis is accurate. At the time when the cache was implemented, we debated as to whether there should be a timer that ran evictions in the background but ultimately found that approach to be unsavory. In hindsight, I had not anticipated that user configuration in JsonSerializerOptions would realistically capture a large amount of memory but I guess I should have known better :-)

That being said, it is almost certain that the application in question is creating single-use JsonSerializerOptions instances, a known anti-pattern that would have caused performance problems even in .NET 6. The author of the original issue could almost certainly address the problem by creating singleton options instances. We're planning on adding an analyzer in the future which should help users avoid that particular anti-pattern.

Also, if this caching behavior was introduced in 7.0, then it could be argued that it's a regression. (Subjective, I know.)

I think we could definitely improve the cache based on this feedback but I don't necessarily think this requires a .NET 7 fix. I think we should wait on more user feedback in case it arrives.

Is the actual problem the shortcut described in https://github.com/dotnet/runtime/blob/main/src/libraries/System.Text.Json/src/System/Text/Json/Serialization/JsonSerializerOptions.Caching.cs#L165-L167? Should this code rather use ConditionalWeakTable to avoid unnecessarily keeping the keys around?

Original prototypes did use ConditionalWeakTable, effectively performing a linear search over all JsonSerializerOptions instances in the process:

https://github.com/dotnet/runtime/blob/f96723a26338169fbeb10b5d0708f5b33f93a957/src/libraries/System.Text.Json/src/System/Text/Json/Serialization/JsonSerializerOptions.cs#L133-L141

But IIRC enumeration performance for the particular type was bad enough that it almost invalidated any performance benefits gained from caching.

willdean commented 2 years ago

That being said, it is almost certain that the application in question is creating single-use JsonSerializerOptions instances, a known anti-pattern that would have caused https://github.com/dotnet/runtime/issues/40072. The author of the original issue could almost certainly address the problem by creating singleton options instances.

I think the original author was using an unmodified (except for the ability to manually trigger GC) Blazor template to demonstrate the problem.

chrdlx commented 2 years ago

That being said, it is almost certain that the application in question is creating single-use JsonSerializerOptions instances, a known anti-pattern that would have caused performance problems even in .NET 6. The author of the original issue could almost certainly address the problem by creating singleton options instances. We're planning on adding an analyzer in the future which should help users avoid that particular anti-pattern.

Hey @eiriktsarpalis Just in case you couldn't read the complete original issue (I'm the author) and just for clarification, the application in question which you are referring to is the Blazor framework itself using the default Blazor Server template. So, if there're problems (like the anti-patterns you mentioned), they are in .NET 7 RC1 itself and its default templates, not user code.

Regards!

chrdlx commented 2 years ago

Regarding what @SteveSandersonMS mentioned about a timer, you can configure the retention period of circuits for Blazor, for example:

builder.Services.AddServerSideBlazor(options => {
                options.DisconnectedCircuitMaxRetained = 5;
                options.DisconnectedCircuitRetentionPeriod = TimeSpan.FromSeconds(10);
            });

So maybe this should be tied somehow with the serializer cache clearing call, I guess it wouldn't make much sense to configure different amounts of time for each since one depends on the other to be useful. I mean, the Blazor App could disconnect forcibly all the users and release their circuits, but if the instances are being held by the serializer cache, memory won't be released, which is the intention.

Regards!

eiriktsarpalis commented 2 years ago

Hi @chrdlx, that's interesting. If there are indeed keys in the cache that require eviction that would suggest to me that Blazor is using temporary options instances, which should be avoided under most circumstances. I'm not too familiar with Blazor internals however, is there anything in its configuration that could prompt single-use instances to be created? Perhaps some options instance only used for initialized purposes that then gets garbage collected? cc @SteveSandersonMS

chrdlx commented 2 years ago

Hi @eiriktsarpalis ! I guess Steve is the right one for this kind of stuff. I use Blazor to develop Apps but sadly I don't have that deep knowledge about its inner workings to give you a proper answer on how to solve this. Regards!

eiriktsarpalis commented 2 years ago

But IIRC enumeration performance for the ConditionalWeakTable was bad enough that it almost invalidated any performance benefits gained from caching.

For context, here are some benchmarks comparing the current caching strategy with one performing linear traversal on a ConditionalWeakTable:

public static CachingContext GetOrCreate(JsonSerializerOptions options)
{
    Debug.Assert(options.IsReadOnly, "Cannot create caching contexts for mutable JsonSerializerOptions instances");
    Debug.Assert(options._typeInfoResolver != null);

    OptionsEqualityComparer comparer = s_equalityComparer;
    foreach (KeyValuePair<JsonSerializerOptions, object?> entry in TrackedOptionsInstances.All)
    {
        JsonSerializerOptions otherOptions = entry.Key;
        if (otherOptions._cachingContext != null && comparer.Equals(options, otherOptions))
        {
            return otherOptions._cachingContext;
        }
    }
    return new CachingContext(options);
}

Method	Job	Method	Mean	Error	StdDev	Median	Min	Max	Ratio	MannWhitney(3%)	RatioSD	Gen 0	Gen 1	Gen 2	Allocated	Alloc Ratio
NewCustomizedOptions	Job-IEKUUC	main	1.128 us	0.0618 us	0.0712 us	1.117 us	1.025 us	1.315 us	1.00	Base	0.00	0.0449	0.0045	0.0045	486 B	1.00
NewCustomizedOptions	Job-ODPMMK	ConditionalWeakTable	39.336 us	0.7793 us	0.8974 us	39.420 us	37.909 us	40.930 us	35.01	Slower	2.35	-	-	-	544 B	1.12
NewCustomizedOptions	Job-TIQXHW	No Caching	19.421 us	0.7321 us	0.8431 us	19.189 us	18.3138 us	21.025 us	18.06	Slower	1.22	0.7392	0.0821	-	7966 B	16.39

SteveSandersonMS commented 2 years ago

that would suggest to me that Blazor is using temporary options instances, which should be avoided under most circumstances

@eiriktsarpalis Blazor has a JsonSerializerOptions instance per circuit (user connection), and has done for over 3 years, at least since this commit. It does this because the serialized representation of certain types (DotNetObjectReference, in this case) is a per-circuit ID, and the act of serializing it for a JS interop call is what causes the instance to become tracked within the circuit. So within the per-circuit options, there is a JsonConverterFactory that holds a reference to the circuit.

I don't think we were aware that, at a certain point, it became expensive to create JsonSerializerOptions instances (or was this always the case?).

If you have a recommendation for a different way that a JsonConverter can access contextual objects specific to a particular seralization process, please let us know.

willdean commented 2 years ago

I don't think we were aware that, at a certain point, it became expensive to create JsonSerializerOptions instances (or was this always the case?).

Stephen Toub's 7.0 performance behemoth had some stuff about this:

https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7/#json

It wouldn't have been intuitive to me that an "Options" class would be massively expensive.

Obviously that article contains the somewhat optimistic promise "... with appropriate removal when no longer used in order to avoid unbounded leaks" which overlooks the old saw about cache eviction being the hardest problem in computer science.

eiriktsarpalis commented 2 years ago

I don't think we were aware that, at a certain point, it became expensive to create JsonSerializerOptions instances (or was this always the case?).

It wouldn't have been intuitive to me that an "Options" class would be massively expensive.

It's been like this since the introduction of the type -- it's essentially the same issue as HttpClient but more insidious since the name "options" doesn't suggest that the type defines its own reflection metadata caching context.

SteveSandersonMS commented 2 years ago

It's been like this since the introduction of the type

What do you recommend as a solution? It does seem like the 7.0 caching technique change has led to an issue for non-server scenarios. Is there (or should there be) some way to signal that a particular JsonSerializerOptions instance can be cleared up? It's not disposable. Should it become disposable?

To be honest, I don't think we'd see this as being urgent enough to fix for 7.0 within Blazor since we only have evidence that it affects one person, and they already have a workaround. What we could do in Blazor in 8.0 though is add some logic to our circuit disposal code that makes it null out the reference from the JsonConverterFactory to the circuit, so the circuits at least can be collected even if there are some JsonSerializerOptions/JsonConverterFactory instances left stuck in the cache until more calls happen.

Since this could affect other (non-Blazor) desktop applications that use STJ, it might be worth having a more general solution such as making JsonSerializerOptions disposable, or clearing the cache on a timer, or going back to ConditionalWeakTable. If you think any of those are a possibility please let us know so we could make use of that in 8.0.

eiriktsarpalis commented 2 years ago

I'm curious, what prompted requiring separate JsonSerializerOptions instances per connection? Presumably it's related to some form of dependency injection on custom converters? Would it be possible to have your converter instances remain fixed per-type application-wide and inject connection-specific state differently?

I know that System.Text.Json doesn't provide a good mechanism for scoping DI on a per-operation basis, here's a user story that attempts to cover this and related requirements: #63795.

SteveSandersonMS commented 2 years ago

I'm curious, what prompted requiring separate JsonSerializerOptions instances per connection?

This is what I mentioned above:

It does this because the serialized representation of certain types (DotNetObjectReference, in this case) is a per-circuit ID, and the act of serializing it for a JS interop call is what causes the instance to become tracked within the circuit. So within the per-circuit options, there is a JsonConverterFactory that holds a reference to the circuit.

Apologies if it's still unclear, but if you can clarify which part of this is unclear I'll try to clarify more!

Would it be possible to have your converter instances remain fixed per-type application-wide and inject connection-specific state differently?

That would be great, but don't know of a way 😄 It's what I was asking with "If you have a recommendation for a different way that a JsonConverter can access contextual objects specific to a particular seralization process, please let us know." #63795 sounds interesting but async conversion isn't an aspect of it for us; we just have to be able to perform a JsonSerializer.Serialize call within some context that allows the converters to access objects within that context (which we achieve today by having a separate JsonConverterFactory for each context).

eiriktsarpalis commented 2 years ago

If you have a recommendation for a different way that a JsonConverter can access contextual objects specific to a particular seralization process, please let us know.

For the case of serialization, state could be encapsulated by the serialized value. I'm not aware of a good way to achieve the same effect in deserialization, unfortunately.

stephentoub commented 2 years ago

the somewhat optimistic promise "... with appropriate removal when no longer used in order to avoid unbounded leaks"

It does avoid unbounded leaks... unbounded being the key word.

cerkoid commented 2 years ago

Hi. This behaviour in blazor server is really sth one would not expect. You can easily ramp up memory usage to several gigs, god knows when evicted :) For one it's hard to reason about, like is there a memory leak in the app or what will happen with smaller apps, where you could have many running on one server each easily consuming some gigs of ram...

I guess this behaviour cannot be avoided in 7.0 release?

eiriktsarpalis commented 2 years ago

I guess this behaviour cannot be avoided in 7.0 release?

We might consider backporting https://github.com/dotnet/runtime/pull/76607 in servicing, assuming there is substantial impact in blazor apps.

dotnet / runtime

Consider clearing JsonSerializerOptions.Caching cache using a timer, not just based on incoming calls #76548

Suggestion