ZiggyCreatures / FusionCache

FusionCache is an easy to use, fast and robust hybrid cache with advanced resiliency features.
MIT License
1.65k stars 87 forks source link

[FEATURE] 🔭 OpenTelemetry support #187

Closed jodydonetti closed 7 months ago

jodydonetti commented 9 months ago

Problem

Observability is a key feature of modern software systems that allows us to clearly see what is going on at any given time.

It is composed of 3 main parts:

FusionCache already has rich logging support via the standard ILogger interface, with various options available to granularly configure it.

Metrics are not currently available in FusionCache natively, even though there's some support via third-party plugins.

Traces instead is currently not available at all.

It would be nice to add native support for the missing parts.

Solution

From some time now OpenTelemetry has become a fundamental staple in modern software development to bring complete observability to our applications.

FusionCache should adopt it.

Thankfully in .NET they adopted a way to do it with which, for library authors, it's possible not to take a dependency on the OpenTelemetry packages and simply use existing concepts in the BCL, like Activity, ActivitySource, Meter etc and have the OpenTelemetry consuming side interact with them, which is nice.

Alternatives

Even though metrics are kinda obtainable via plugins (as it already is, see the link above), it would be better to have native support for it.

Traces instead are not achievable unless there's native support for it inside of FusionCache internals.

Questions

I still have some open questions, which I'll list here: if anyone has a suggestion about it please chip in!

Thanks.

Logging

Having already rich support for logging via the ILogger interface, is there something else that should be done regarding OpenTelemetry integration? Maybe I'm missing something.

Traces: minimum level?

Regarding traces I've yet to undersand how to fine-tune the "verbosity" of it, meaning that if with Logging we have the common concept of a "minimum log level" to avoid bloating our logs in production, I'm still trying to understand how to achieve the same level of control with traces.

One idea may be to have different bool flags to enable/disable traces for different parts of the system like memory cache, distributed cache, backplane, etc with options like bool EnableDistributedCacheTraces or similar.

Another approach may be to re-use the LogLevel type with an option named something like TracesMinLevel, with which a user can choose the level of details of the traces. I don't like that it would be mixing different concepts (logs/traces) so if there's something better I'd like that.

Traces: instance granularity?

Another question is about instance granularity: with logging I can create 2 different instances of FusionCache and pass a logger to one but not the other: how can we achieve something like this with traces?

I was thinking about an instance-level bool switch (or the aforementioned LogLevel one) via instance-specific options, so that it would be possible to achieve that.

Metrics: which ones?

What metrics should be collected? I'm thinking probably the ones related to already existing events like cache hit (general/fresh/stale), cache miss, factory success, factory error, etc. Anything else? Since in metrics, on top of counters, there are also histograms would it make sense to also record things like factory durations, distributed cache operations durations and so on? Maybe it's overkill, maybe not.

Also with metrics, should we always record them all or allow a granular selection via some options?

Metrics: instance granularity?

Just like with traces, the question is if there's a way to enable/disable metrics for a specific cache instance?

Statistics

Finally, on a related note, by adding native support for metrics, is there a meaning in also adding some sort of direct access to locally collected statistics, like the ones recently added in MemoryCache? On one hand it seems a nice addition, but on the other hand it would be like collecting metrics twice, and for what? Also, metrics with OpenTelemetry are long-lived, whereas simple stats stored only in memory would be wiped out at each restart of the application, and by having real metrics support what would be the points? My current position is not to add them, but if you have any idea for why I should, please let me know.

jodydonetti commented 9 months ago

This is a sample of the result we will have soon, in this case via the Jaeger UI:

image

This example is using a lot of different features of FusionCache all at once, so that I can see what would be like to have one of the most complex scenario rendered out:

Looks good?

marafiq commented 9 months ago

I am more interested in Metrics with respect to Cache as they provide more value to make efficient use of cache, helps you adjust your cache strategy.

I'm thinking probably the ones related to existing events like cache hit (global/fresh/stale), cache miss, factory success, factory error, etc.

The above metrics are a good start: Cache hit/miss will be a good start. Do not care much about global vs per instance level metrics as long as I can see the ratios of usage. But global seems a good start.

Traces are very handy when debugging production. My employer uses DynaTraces, and I know from experience that it has saved us a lot of hours and helped fix things quickly. But in terms of cache, I think traces seem to have more value when serving stale data than all the time. Because, when serving stale data, I would like to know, how it ended up here especially when serving stale is an edge case.

jodydonetti commented 9 months ago

Do not care much about global vs per instance level metrics as long as I can see the ratios of usage. But global seems a good start.

Maybe I've expressed myself badly here, what I meant was:

Maybe "general" would've been a better word for it, I'm updating the wording, thanks.

marafiq commented 9 months ago

fresh: hit count for non-stale values stale: hit count for stale values

I would prefer the above because my experience is stale is an edge case that makes the difference as you outlined valuable

jodydonetti commented 9 months ago

I would prefer the above because my experience is stale is an edge case that makes the difference as you outlined valuable

Yes, I'm planning to add all 3 of them, so you can pick the one you are more interested in.

Basically if the cache had 10 cache hits, 3 of which were stale and 7 of which were not, you'll have 3 counters with:

Sounds good?

akoken commented 9 months ago

Logging

Having already rich support for logging via the ILogger interface, is there something else that should be done regarding OpenTelemetry integration? Maybe I'm missing something.

I'm not sure about that TBH. We prefer to use ELK stack for logging. However, some users might desire an all-in-one solution encompassing logging, tracing, and metrics using OpenTelemetry. This could be implemented based on user demand.

Traces: minimum level?

One idea may be to have different bool flags to enable/disable traces for different parts of the system like memory cache, distributed cache, backplane, etc with options like bool EnableDistributedCacheTraces or similar.

Using boolean flags are very common approach.

Another approach may be to re-use the LogLevel type with an option named something like TracesMinLevel, with which a user can choose the level of details of the traces. I don't like that it would be mixing different concepts (logs/traces) so if there's something better I'd like that.

I believe the concept of TraceLevel doesn't currently exist. A disscussion sheds light on why it's absent.

Metrics: which ones?

What metrics should be collected? I'm thinking probably the ones related to already existing events like cache hit (general/fresh/stale), cache miss, factory success, factory error, etc. Anything else? Since in metrics, on top of counters, there are also histograms would it make sense to also record things like factory durations, distributed cache operations durations and so on? Maybe it's overkill, maybe not.

Necessary metrics for adjusting the cache strategy could be recorded by default, while others can be managed by boolean flags.

Also with metrics, should we always record them all or allow a granular selection via some options?

It depends on performance impact. Granular selection could be better option.

Statistics

Finally, on a related note, by adding native support for metrics, is there a meaning in also adding some sort of direct access to locally collected statistics, like the ones recently added in MemoryCache? On one hand it seems a nice addition, but on the other hand it would be like collecting metrics twice, and for what? Also, metrics with OpenTelemetry are long-lived, whereas simple stats stored only in memory would be wiped out at each restart of the application, and by having real metrics support what would be the points? My current position is not to add them, but if you have any idea for why I should, please let me know.

I aggree.

akoken commented 9 months ago

This is a sample of the result we will have soon, in this case via the Jaeger UI:

image

This example is using a lot of different features of FusionCache all at once, so that I can see what would be like to have one of the most complex scenario rendered out:

  • it has fail-safe enabled
  • I added some simulated delays via chaos components, to stress test soft/hard timeouts with background factory completion (the move of the running factory to a background thread is marked with an ActivityEvent, which I think is nice)
  • I enabled background execution of distributed cache and backplane operations, so they are not blocking the main GetOrSet call, since I wanted to see different dependent timelines goign "outside" the bounds of the main call
  • more

Looks good?

It looks amazing @jodydonetti!

jodydonetti commented 9 months ago

I'm not sure about that TBH. We prefer to use ELK stack for logging. However, some users might desire an all-in-one solution encompassing logging, tracing, and metrics using OpenTelemetry. This could be implemented based on user demand.

Based on what I've read, the story about logging + OpenTelemetry in .net is just this: use the existing shared abstractions (ILogger & co) and add a sink based on OpenTelemetry, like this:

builder.Logging.ClearProviders()
    .AddOpenTelemetry(loggerOptions =>
    {
        loggerOptions
            // define the resource
            .SetResourceBuilder(resourceBuilder)
            // add custom processor
            .AddProcessor(new CustomLogProcessor())
            // send logs to the console using exporter
            .AddConsoleExporter();

        loggerOptions.IncludeFormattedMessage = true;
        loggerOptions.IncludeScopes = true;
        loggerOptions.ParseStateValues = true;
    });

I believe the concept of TraceLevel doesn't currently exist. A disscussion sheds light on why it's absent.

Ah, thanks for linking this, I'll have a read about that!

Necessary metrics for adjusting the cache strategy could be recorded by default, while others can be managed by boolean flags.

Good point, will experiment with this approach.

jodydonetti commented 9 months ago

It looks amazing @jodydonetti!

Thanks! I'm really happy about how it is already looking 😬

cijothomas commented 8 months ago

Logging Having already rich support for logging via the ILogger interface, is there something else that should be done regarding OpenTelemetry integration? Maybe I'm missing something.

If already using ILogger, that is good enough! OTel is not introducing new logging API, and instead it integrates ILogger.

One idea may be to have different bool flags to enable/disable traces for different parts of the system like memory cache, distributed cache, backplane, etc with options like bool EnableDistributedCacheTraces or similar.

If this is referring to Logs, then using appropriate Logger Category is good, as users can filter/set-log-level etc. based on LoggerCategory.

If this is referring to traces (as in distributed traces), then you can consider different ActivitySource for each logical component. OTel allows enable/disable per each ActivitySource.

jodydonetti commented 7 months ago

Hi all, I just released v0.25.0-preview1 🎉

This is a pre-release version which already contains basically all the observability work done, with both traces and metrics.

Please try it and let me know what you think.

ps: many thanks to @martinjt for the invaluable support in better understanding the whole OpenTelemetry game, including some nuanced details about best practices, usages and so on. Thanks!

jodydonetti commented 7 months ago

Anybody have been able to try this out?

Any opinion would be greatly appreciated!

akoken commented 7 months ago

Hi @jodydonetti

Sorry I wasn't able to get back to you sooner. I've been quite occupied in the past few weeks. I've recently explored the tracing feature, and it's exactly what I needed. Thanks again for all your efforts!

image

jodydonetti commented 7 months ago

Thanks @akoken , I'm very glad it's working well for you!

Btw, sneak peek: v0.25.0-preview2 coming soon, with some more tweaks and perf improvements to telemetry + the last 2 new features before finally going v1.0 😬

akoken commented 7 months ago

Awesome 🚀 Looking forward to the final version.

JoeShook commented 7 months ago

Just now getting a chance to look.

Lookint at the OnEviction and comparing to my plugin behavior. In my Plugin the I chose to have two counters (instruments). One for "expire" and one for "capacity". You can see the logic in the Plugin's OnExpire handler.

Or you could do follow the OnHit pattern by adding a tag to indicate eviction reason.

Assuming I find more time this week to test this out, my goal is to not be dependent on my ZiggyCreatures.FusionCache.Metrics package.

The labor will mostly lie in naming convention mapping. Like in my environment all apps have a prefix for all meter names. But this should not be a problem. For example, I can call WithMetrics again and add a metric view via the AddView extension. The example below is just prefixing. I also need to map tag names to translate to the naming convention dependencies I have on existing Grafana dashboards. I will have to see if there is a technique for this in the Otel libraries. Trying to avoid doing the translation in a place like telegraf or Otel collectors.

If I don't find a way then I would come back to the technique I mentioned before. Create a ISemanticConventions and default implementation that holds all the const values for meternames and tags. Then the consumer can pass in it's own imlementation of ISemanticConvention to control naming convention.

builder.Services.AddOpenTelemetry()
    .WithMetrics(metricsBuilder =>
    {
        metricsBuilder.SetResourceBuilder(resourceBuilder);
        metricsBuilder.AddInfluxDBMetricsExporter(options =>
        {
            ...
        });

        metricsBuilder.ConfigureResource(r => r.AddAttributes(new List<KeyValuePair<string, object>>()
        {
            new KeyValuePair<string, object>("application", appInfo.Name),
            new KeyValuePair<string, object>("applicationVersion", buildInfo.Version),
        }));

        metricsBuilder.AddView((instrument) =>
        {
            if (instrument.Name.StartsWith("fusioncache."))
            {
                return new MetricStreamConfiguration()
                {
                    Name = $"somePrefix_{appInfo.Name}_{instrument.Name}"
                };
            }

            return null;
        });
    });

So far things are looking good. I am very happy Open Telemetry is arriving in FusionCache.

martinjt commented 7 months ago

Interesting conversation on naming there.

Having the consumer of a library manage the naming of meters, instruments and attributes is more of an anti-pattern that a library shouldn't support.

I can understand your usecase (I.e. you have existing dashboards etc.) And this is more of a usecase for the collector (which you mentioned not wanting to do which is interesting).

The Semantic Conventions are set by otel, for everything else is naming conventions specific to the repo. So if FusionCache were to support that, I wouldn't recommend using the term "semantic" as it implies you'll be overriding the ratified conventions (when they are ratified).

From a naming perspective, the hierarchy should be.

For application code, you should use your own prefix, but that shouldn't influence the names given to signal data from libraries.

JoeShook commented 7 months ago

Thanks @martinjt for the comments. I almost didn't even mention how I used my own ISemanticConventions implemenation. My early usage of this technique was not to replace the ISemanticConventions but to use the technique in my own context. Going forward I am going to follow your advise and stop using this technique. I agree it felt like an anti-pattern as I revisited it.

jodydonetti commented 7 months ago

Thanks, these insights are very useful in better understanding the nuances that sometimes get lost.

jodydonetti commented 7 months ago

Hi all, I just release v0.25.0 🥳