[Issue] FusionCache allocates on reads and allocates a lot on writes

neon-sunset commented 9 months ago

Describe the issue See numbers below. There are other issues like cache line pollution on reads through interlocked applied to a singular location which will ruin the scaling on many-core systems but I wanted to start with bringing attention to the implementation flaws in the first place.

In the outlined scenario of in-memory cache, FusionCache comes last and allocates significantly more than all other options. While FastCache is very focused at doing a singular task (in-memory cache with per-item expiration and eviction) that allows it to be as fast as is technically possible, MemoryCache and LazyCache have much more features and still allocate much less.

Unrelated but somewhat amusing is that OpenJDK improves by 10% at most between 17 and 21, and .NET sees 20-40% speed-up after just a single LTS version bump.

Benchmark code

```csharp using System; using System.Diagnostics.CodeAnalysis; using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Jobs; using FastCache; using LazyCache; using Microsoft.Extensions.Caching.Memory; namespace ZiggyCreatures.Caching.Fusion.Benchmarks { [RankColumn] [MemoryDiagnoser] [ShortRunJob(RuntimeMoniker.Net60)] [ShortRunJob(RuntimeMoniker.Net80)] [Orderer(BenchmarkDotNet.Order.SummaryOrderPolicy.FastestToSlowest)] public class HappyPathBenchmark { const string Key = "test key"; const string Value = "test value"; readonly FusionCache Cache = new(new FusionCacheOptions()); readonly MemoryCache MemoryCache = new(new MemoryCacheOptions()); readonly CachingService LazyCache = new(); [GlobalSetup] public void Setup() { Cache.Set(Key, Value); MemoryCache.Set(Key, Value); LazyCache.Add(Key, Value); Cached.Save(Key, Value, TimeSpan.FromDays(1)); } public class Reads : HappyPathBenchmark { [Benchmark(Baseline = true)] public string GetFusionCache() { var result = Cache.TryGet(Key); return result.HasValue ? result.Value : Unreachable(); } [Benchmark] public string GetMemoryCache() { return MemoryCache.TryGetValue(Key, out var value) ? value : Unreachable(); } [Benchmark] public string GetLazyCache() { return LazyCache.TryGetValue(Key, out var value) ? value : Unreachable(); } [Benchmark] public string GetFastCache() { return Cached.TryGet(Key, out var value) ? value : Unreachable(); } } public class Writes : HappyPathBenchmark { [Benchmark(Baseline = true)] public void SetFusionCache() => Cache.Set(Key, Value); [Benchmark] public void SetMemoryCache() => MemoryCache.Set(Key, Value); [Benchmark] public void SetLazyCache() => LazyCache.Add(Key, Value); [Benchmark] public void SetFastCache() => Cached.Save(Key, Value, TimeSpan.FromDays(1)); } [DoesNotReturn] static string Unreachable() => throw new Exception("Unreachable code"); } } ```

Reads:

BenchmarkDotNet v0.13.10, Windows 11 (10.0.22631.2715/23H2/2023Update/SunValley3)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100
  [Host]            : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  ShortRun-.NET 6.0 : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2
  ShortRun-.NET 8.0 : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2

IterationCount=3  LaunchCount=1  WarmupCount=3

Method	Job	Runtime	Mean	Error	Ratio	Rank	Gen0	Allocated	Alloc Ratio
GetFastCache	ShortRun-.NET 6.0	.NET 6.0	12.963 ns	0.3573 ns	0.10	1	-	-	0.00
GetMemoryCache	ShortRun-.NET 6.0	.NET 6.0	88.737 ns	4.0533 ns	0.71	2	-	-	0.00
GetLazyCache	ShortRun-.NET 6.0	.NET 6.0	104.442 ns	2.1849 ns	0.84	3	-	-	0.00
GetFusionCache	ShortRun-.NET 6.0	.NET 6.0	124.161 ns	8.5320 ns	1.00	4	0.0038	64 B	1.00

GetFastCache	ShortRun-.NET 8.0	.NET 8.0	9.770 ns	1.5593 ns	0.11	1	-	-	0.00
GetMemoryCache	ShortRun-.NET 8.0	.NET 8.0	58.887 ns	1.7557 ns	0.69	2	-	-	0.00
GetLazyCache	ShortRun-.NET 8.0	.NET 8.0	73.699 ns	2.8998 ns	0.86	3	-	-	0.00
GetFusionCache	ShortRun-.NET 8.0	.NET 8.0	85.310 ns	44.8033 ns	1.00	4	0.0038	64 B	1.00

Writes

Method	Job	Runtime	Mean	Error	Ratio	Rank	Gen0	Allocated	Alloc Ratio
SetFastCache	ShortRun-.NET 6.0	.NET 6.0	29.79 ns	1.440 ns	0.07	1	0.0024	40 B	0.04
SetMemoryCache	ShortRun-.NET 6.0	.NET 6.0	194.55 ns	19.804 ns	0.48	2	0.0248	416 B	0.40
SetLazyCache	ShortRun-.NET 6.0	.NET 6.0	309.68 ns	50.771 ns	0.76	3	0.0401	672 B	0.64
SetFusionCache	ShortRun-.NET 6.0	.NET 6.0	409.26 ns	37.500 ns	1.00	4	0.0625	1048 B	1.00

SetFastCache	ShortRun-.NET 8.0	.NET 8.0	23.16 ns	0.644 ns	0.07	1	0.0024	40 B	0.04
SetMemoryCache	ShortRun-.NET 8.0	.NET 8.0	186.80 ns	7.310 ns	0.53	2	0.0248	416 B	0.43
SetLazyCache	ShortRun-.NET 8.0	.NET 8.0	258.38 ns	25.020 ns	0.73	3	0.0353	592 B	0.61
SetFusionCache	ShortRun-.NET 8.0	.NET 8.0	352.62 ns	17.272 ns	1.00	4	0.0577	968 B	1.00

jodydonetti commented 9 months ago

Hi @neon-sunset and thanks for your report, I appreciate the time it took you.

As you have pointed out other libs do in fact more than FastCache, and FusionCache in particular I think supports a lot of features and options, so it's something kinda expected, at least up to a certain point.

Having said that it's important to make sure the perf impact is as small as possible: in the last months I've been quite busy finishing the new Auto-Recovery feature which took a lot of time, and it may be that I haven't paid the needed attention to some extra allocations or cpu cycle spent here and there.

I'll try to make sure that the resources needed are as low as possible, and will update you on this later on as soon as I have updates: keep in mind though that to support all the extra features that FusionCache provides (2nd level, fail-safe, auto-recovery, etc), some extra resources are to be accounted for.

Meanwhile thanks again for your report!

neon-sunset commented 9 months ago

I'm mostly worried about avoidable cost caused by some of the internal design decisions that are not necessary to achieve the desired functionality. This library is fairly popular so it is important to ensure that it does not suffer from common cache implementation issues.

In particular, Microsoft libraries like MemoryCache and FASTER do support many features - the former fully integrates with logging/tracing/ETW and supports flexible eviction behavior and callbacks, can be integrated with second-level distributed cache and much more without allocating nearly as much, and that despite its own design being also considered flawed/outdated for certain scenarios and memory footprint being higher than it could have been. See https://github.com/dotnet/extensions/issues/4766

One of the key concepts promoted in .NET in the last few years is "pay for play" policy - users should only pay for what they use whenever possible.

Thank you for your work.

jodydonetti commented 9 months ago

I'm mostly worried about avoidable cost caused by some of the internal design decisions that are not necessary to achieve the desired functionality. This library is fairly popular so it is important to ensure that it does not suffer from common cache implementation issues.

I agree, and I'll try to see if I can improve it.

See dotnet/extensions#4766

eh, I know about this new discussion: I've been part of the old discussion some years ago so I'm writing my thoughts on this new one, will publish them later today.

One of the key concepts promoted in .NET in the last few years is "pay for play" policy - users should only pay for what they use whenever possible.

Totally agree with the concept: on a pragmatic note though, with the time and resources available to me (it's an oss work done in my spare time), personally I tried to find the right balance between adding useful features that cover various scenarios and common production problems with performance and resource allocation. It's not always easy as you can imagine, but I'll strive to make it even better in the future.

Thank you for your work.

Thanks to you for sharing your thoughts, it's good to see participation!

jodydonetti commented 9 months ago

Oh, one thing I'd like to add that may have been missing here, is that currently FusionCache is based on the 2 main caching abstraction in .net, IMemoryCache and IDistributedCache. This is to say that, as of today, the perf limits will be MemoryCache + a delta, meaning it cannot be lower than MemoryCache.

Having said that, while there's a huge value in being able to use any implementation of IDistributedCache, there's less value in being able to support any implementation of IMemoryCache (which, basically, means MemoryCache). Because of this I'm thinking from some time to detach the hard dependency from IMemoryCache to bypass its limits, and to come up with support for more performant and lightweight implementations. This will be doable because from the start I decided to support the 2 core interfaces as "inputs", but to never expose them as "outputs": so I can add support without breaking anything.

As said I've been experimenting from some time with this approach, but currently have nothing to see yet, will update.

ps: is FastCache usable for heterogeneous data or just homogeneous? Meaning, can I use it for different types of data in the same cache (maybe by just using it with <object>) ?

Thanks.

jodydonetti commented 6 months ago

Hi @neon-sunset , I just finished a big effort in reducing cpu usage and memory allocation, and I pushed all the changes in the release/v1_0_0 branch.

First of all I'd like to thanks you again for opening this issue and pointing out some perf shortcoming that FusionCache had: lately I've been very busy with completing the last big features (the new Auto-Recovery, Open Telemetry support, etc), preparing for an On .NET episode and finishing the last touches for the big v1.0 release, culminated in the preview1 last week.

Having said that, what you pointed out is important, so I finally took some time to go allocation hunting and trying to optimize whatever I could in the common happy paths: I'm happy to report that I've been able to get some really nice results, which I'd like to show (again using your own benchmark code).

Set

Before:

Method	Mean	Error	StdDev	P95	Ratio	Rank	Allocated	Alloc Ratio
SetFastCache	17.22 ns	0.283 ns	0.265 ns	17.65 ns	0.13	1	40 B	0.12
SetMemoryCache	54.79 ns	0.481 ns	0.450 ns	55.55 ns	0.42	2	104 B	0.31
SetLazyCache	95.75 ns	1.070 ns	1.001 ns	97.07 ns	0.74	3	280 B	0.83
SetFusionCache	128.94 ns	1.373 ns	1.284 ns	130.83 ns	1.00	4	336 B	1.00

After:

Method	Mean	Error	StdDev	P95	Ratio	Rank	Allocated	Alloc Ratio
SetFastCache	17.55 ns	0.204 ns	0.191 ns	17.84 ns	0.16	1	40 B	0.25
SetMemoryCache	54.51 ns	0.397 ns	0.372 ns	55.06 ns	0.50	2	104 B	0.65
SetLazyCache	94.94 ns	0.892 ns	0.834 ns	96.08 ns	0.87	3	280 B	1.75
SetFusionCache	109.46 ns	1.102 ns	1.030 ns	111.10 ns	1.00	4	160 B	1.00

As you can see I've been able to drop the cpu usage by about 15%, barely above LazyCache while having more features, and we are still talking about nanoseconds.

The allocations instead dropped by 52%, from 336 bytes to 160 bytes: LazyCache now allocates 75% more, again while having less features.

Get

Before:

Method	Mean	Error	StdDev	P95	Ratio	Rank	Allocated	Alloc Ratio
GetFastCache	6.474 ns	0.1256 ns	0.1174 ns	6.683 ns	0.15	1	-	0.00
GetMemoryCache	26.896 ns	0.3287 ns	0.2914 ns	27.140 ns	0.62	2	-	0.00
GetLazyCache	34.369 ns	0.4198 ns	0.3927 ns	34.895 ns	0.79	3	-	0.00
GetFusionCache	43.267 ns	0.2805 ns	0.2190 ns	43.494 ns	1.00	4	48 B	1.00

After:

Method	Mean	Error	StdDev	P95	Ratio	Rank	Allocated	Alloc Ratio
GetFastCache	6.511 ns	0.0934 ns	0.0874 ns	6.633 ns	0.17	1	-	NA
GetMemoryCache	26.503 ns	0.2110 ns	0.1974 ns	26.814 ns	0.68	2	-	NA
GetLazyCache	33.986 ns	0.2404 ns	0.2131 ns	34.317 ns	0.87	3	-	NA
GetFusionCache	38.936 ns	0.2911 ns	0.2581 ns	39.367 ns	1.00	4	-	NA

Here cpu went down about 10-15% , which is nice.

Allocations instead became awesome: they dropped to a beautiful zero, so no more allocations (damn closures).

Conclusions

This is not the end of the optimization story, and now that all the main features (and then some) are there and v1.0 is around the corner, I plan on working even more on optimizing cpu usage and allocations.

Of course, as you pointed out yourself, while the other caches are memory (L1) caches with a set of features that goes from barebone (so to be micro-optimized) to relatively small, FusionCache is an hybrid cache (transparently L1 or L1+L2) with a lot of features available: resiliency, timeouts, optional 2nd level with a backplane, extensive logging, fully observable (Open Telemetry) and more.

All of this to say that, while the "pay for play" policy is something I absolutely agree on it's also fair to say that with all these features, even just the various "is null/enabled" checks here and there for each feature + a minimum of extra memory for the needed metadata in support of them, it will always be a kind of apples to oranges comparison, in a way.

I'm interested in knowing what you think, if you'd like.

Meanwhile, again thanks!

neon-sunset commented 6 months ago

Awesome! The reductions in memory usage are welcome and will help with reducing the churn on load spikes.

As for you asking whether FastCache is usable within FusionCache - at this point I'm regretting quite aggressive design decision to make it a global static cache, it definitely does not fit together well with FusionCache which is instanced and isolated in a more, you could say, reasonable way. Maybe in the future in some version 2.0 this will change but I just don't have time for this at the moment unfortunately. The idea of very thin layer on top of NonBlocking.ConcurrentDictionary worked really well in production with eviction not being "front-loaded" like other cache libraries do - letting another thread deal with that. But it needs work. There's another JitBit.FastCache library which ~~stole~~ copied the approach from FastCache.Cached but it does not do any of the necessary eviction heavy lifting (you have to do some form of parallelization and you can't be running the thing every 10s or so), does not use NonBlocking and boxes cached value in another class, all three of which become big limitations when scaling to >1-10M items. Just letting you know so you don't repeat repeat its mistakes.

Therefore, if you think it's worth investing the effort, FusionCache could see either its own FastCache-like memory cache implementation or opt to wait for the better first-party alternative to MemoryCache to be provided. The third extra difficult option which is what I wanted to do back then for 2.0 of fastcache was writing a Segcache-like implementation envisioned after Twitter's take on Segcache (https://github.com/twitter/pelikan/tree/master/src/storage/seg) but version 1.x turned out to be sufficient in terms of both memory and eviction throughput scaling for the tasks I was solving back then so such 6-12 months-long undertaking lost its attractiveness 😅

As for your note regarding the inherent cost of internal and external abstractions, there are a couple of points I wanted to share which may be of interest to you:

As of .NET 8, the actual cost on microbenchmarks will be hidden because DynamicPGO will devirtualize the calls to most frequently used type of virtual/interface call or a delegate. This makes estimating performance more difficult because should the user's code not have such monomorhpic distribution of used instances, the cost of abstractions will resurface. You may also want to use [SimpleJob(RuntimeMoniker.NativeAot80)] on occasion and see how much its performance differs from JIT which can partially showcase you how the code would perform should DynamicPGO be not as successful
Internal abstractions can be made zero-cost by using struct generics constrained to an interface type over accepting interface type itself - .NET will monomorphize such calls to exact implementation e.g. CoreLib uses this technique heavily for performance-sensitive code. However, this is one of the "last resort" optimizations that are useful when other low hanging fruits have been taken. So if you feel like continuing performance work, you may want to start with the profiler instead, I'm mostly saying this because am too guilty of not doing so and favouring blind microbenchmarks instead 😀

Watching the episode right now :)

jodydonetti commented 6 months ago

Thanks for all the info, really damn interesting.

Will definitely try the AOT benchmark: I was already curious and now even more 😬

About the second point: do you have maybe some material about that? Blog post, a docs page, examples, anything really.

Thanks again, and if you like let me know how about the episode.

jodydonetti commented 6 months ago

Hi all, I just released preview2 🥳

Please give it a try, thanks!

jodydonetti commented 6 months ago

Hi all, I've finally released v1.0 🥳

Feels a little unreal tbh, but yep: v1.0 is out.

ZiggyCreatures / FusionCache