fluxcd / source-controller

The GitOps Toolkit source management component
https://fluxcd.io
Apache License 2.0
239 stars 187 forks source link

Cache metric issues #1644

Open darkowlzz opened 1 day ago

darkowlzz commented 1 day ago

This is to highlight the issue with the cache metrics and discuss how to handle it with respect to the new generic typed cache.

At present, in source-controller we have a custom cache implementation for storing the Helm repository index, refer https://github.com/fluxcd/source-controller/tree/v1.4.1/internal/cache. This cache can store any value in the form of an empty interface interface{}. For instrumenting the cache, it exports a metric by the name gotk_cache_events_total with labels event_type, name and namespace for type of cache event (hit or miss), name of the Flux object for which the cache is used and namespace of the object, respectively.

Following is how it appears when exported:

# HELP gotk_cache_events_total Total number of cache retrieval events for a Gitops Toolkit resource reconciliation.
# TYPE gotk_cache_events_total counter
gotk_cache_events_total{event_type="cache_hit",name="podinfo",namespace="default"} 3
gotk_cache_events_total{event_type="cache_miss",name="podinfo",namespace="default"} 1

If it is known that the only cache in source-controller is the Helm index cache, this seems fine. But when we need to add other caches in source-controller for other purposes, this metric gets confusing.

In https://github.com/fluxcd/pkg/tree/main/cache, a new generic typed cache is implemented which will be used for different cache implementations across Flux. See https://github.com/fluxcd/pkg/pull/817 for the latest implementation of the generic typed cache.

The following sections describe the issues with the existing metrics with respect to the new generic typed cache.

Lack of object kind information

When a new cache instance (using the existing cache implementation), say auth token cache, is added for GitRepository, OCIRepository, etc, due to lack of the object kind in the metric label, we can't determine which object a metric belongs to. Two objects of different kinds may have the same name and namespace.

The new generic typed cache also has a similar metric but it also includes the object kind, which would help determine the associated object accurately.

Fixed metric name

The cache metric has hard-coded name, gotk_cache_events_total. If new cache instance and metrics are added, they'll have the same name. The metric registerer panics with the error duplicate metrics collector registration attempted when attempting to register two metrics with the same name.

This is true for the new generic typed cache too. The metric name is hard-coded.

In case of the old/existing cache implementation, the stored value is an empty interface, which can store any value. This enables the usage of the same cache for storing different items, although that may not be ideal, depending on the use case. A single shared cache for everything would work fine.

But the new generic typed cache can't be shared because the cache instance are created with their type and can only store values of specific type. This property of the new cache requires creating separate caches for storing different types of data. A cache for storing the helm index cache of type IndexFile. Another cache for storing auth tokens of type string. The two with hard-coded metric names can't be registered with the prometheus registerer.


Possible solutions

I believe ideally, the new cache would replace the existing cache in source-controller. But for that we need to address the above issues. Using the new generic typed cache will address the first issue as it includes the object kind in the metric. But for addressing the fixed metric name issue, we may need to allow dynamic/variable names.

The constructor of the cache metrics can accept a prefix which can be added before the metric's fixed name. For example, if the fixed name is cache_events_total, a prefix gotk_helm_index could be passed, resulting in a new metric with name gotk_helm_index_cache_events_total. This would make is easier to instrument individual caches and identify them easily. In case of the helm index, all the metrics will belong to HelmRepository, but for auth token cache, the same instance of cache may be shared amongst multiple reconcilers and the object kind would help filter the metric for specific object kinds.

This would also benefit other users of our packages who may like to enable the caching support. They can pass their own prefix, as gotk_cache wouldn't mean anything to their users.

Concerns

The new generic typed cache implementation will be updated accordingly.

stefanprodan commented 1 day ago

I'm for adding the cache prefix to the package and have dedicated metrics per kind in source-controller. Given that Git and OCI can't share the same key/token for Azure, I suspect this will be the case for other providers too.

matheuscscp commented 11 hours ago

I think the cache metrics should have specific names based on use case, and not specifically kind. Each use case will most likely always have a specific kind associated with, but I don't think the kind is what matters the most, as we may have two different use cases that are both associated with the same kind. I can't think of a specific example from the top of my head, but storing Helm indexes and auth tokens are already an example of two different use cases for caching, and the kind they relate to is only crucial for the former.

So I'm also for adding a metric name prefix to the metric/cache constructor. I'm fine with making the kind part of this prefix, as part of the use case name.