Rate limiting might cause significant memory usage

Environment Description

Clusters Architecture

We have a system with multiple Kubernetes clusters, where one cluster is the main one. The difference between the main cluster and the other clusters is that the main cluster also holds metadata related to the components on the other clusters. The structure looks as follows:

Main Cluster
- Namespace 1
- Deployment 1.1
- Deployment 1.2
- Metadata 1
- Namespace 2
- Metadata 2
- Namespace 3
- Metadata 3
- ...
Cluster 2
- Namespace 2
- Deployment 2.1
- Deployment 2.2
- ...
Cluster 3
- Namespace 3
- Deployment 3.1
- Deployment 3.2
- ...
...

This configuration results in a significantly larger number of namespaces on the main cluster compared to the other clusters:

Main Cluster: ~4400
Cluster 2: ~300
Cluster 3: ~300
...

Exporter Configuration & Scraping

We run the exporter with the --watch-kube-secrets flag. When this flag is enabled, the exporter works as follows:

When it starts, it fetches all namespaces and, for each namespace, it retrieves and parses secrets.
When the /metrics endpoint is called, it executes the same process.

We scrape the metrics every minute, which was causing a high load on the Kubernetes API server. To address this, we set the --max-cache-duration flag to 5 minutes. This setting helps to avoid overloading the Kubernetes API server with too many requests after startup. However, we still encountered a problem when the exporter was started for the first time, resulting in more than 4400 API queries. Therefore, we were very pleased to see the rate limiting feature introduced in the 3.14.0 release.

Problem Description

On a cluster with ~4400 namespaces, we configured the rate limiting as follows:

qps: 20
burst: 10

This configuration led to a significant increase in memory usage. Previously, the exporter was running fine with less than 100 MiB of memory. However, with this configuration, it began consuming more than 2 GiB, which ultimately resulted in the process being killed due to exceeding the configured limits.

exporter-memory-usage

We performed the analysis, and now there are a few things worth to address.

Detected Problems

Information About the Used Rate Limiting Algorithm

The feature is based on the Token Bucket Rate Limiting Algorithm. There are two strategies how the bucket is filled with tokens, so new queries could be processed.

Interval-based

It adds qps tokens to the bucket every second. This means the qps value should be equal to or lower than burst. In our case, it was set to:

qps: 20
burst: 10

However, this effectively resulted in:

qps: 10
burst: 10

With the cache set to 5 minutes, this configuration could handle a maximum of ~3000 requests (10 tokens/second 60 seconds/minute 5 minutes).

Greedy

It adds 1 token to the bucket every 1/qps seconds. In this case, having a larger value for qps compared to burst is not an issue.

With the cache set to 5 minutes, our configuration could handle a maximum of ~6000 requests (20 tokens/second 60 seconds/minute 5 minutes).

The exporter uses an interval-based strategy. This means we configured it in such a way that it was never able to process all namespaces. By the time it had a chance to update the cache for the last namespace, the first one had already expired again.

I believe that documenting information about the algorithm used would be very helpful in preventing such issues.

Note: the new --kube-api-rate-limit-qps and --kube-api-rate-limit-burst flags have not been documented in the README.md file.

Ability to Include Namespaces by Label or Selector

Currently, it is possible to include and exclude namespaces by name. This approach might work for small clusters, but it becomes problematic when namespaces are created by a system. In such cases, inclusion and exclusion by name can be cumbersome. The system can generate the list of names, but any change would require modifying the exporter deployment to update the command-line arguments. Considering that starting the exporter involves fetching all data, this is not a feasible solution.

Such a feature would allow us to label the namespaces, so we could limit the request to watch ~300 namespaces on the main cluster instead of ~4400.

Optimize Algorithm To Collect Data

The current implementation executes logic to fetch and parse all certificates when the /metrics endpoint is called. For each namespace, it checks if the cache is valid:

Yes: Return the cached data.
No: Fetch the data and save it into the cache.

Since the rate limiting feature was introduced, a few issues have arisen with this approach.

Add Mutex to Prevent Concurrent Updates to the Same Namespace Cache

No mutex is used for updating the cache, so when multiple /metrics calls are in progress, they may try to check the same namespace simultaneously. Without rate limiting, this situation occurs rarely. However, with the rate limiting feature, the likelihood of multiple calls attempting to update the same namespaces increases, especially when the qps and burst values are close to the number of namespaces. The logic always processes namespaces in alphabetical order, so the further along in the list, the higher the chance of contention for updating the same namespace cache due to a lack of tokens in the rate limiting bucket.

Prevent Parallel Data Collection

Before the rate limiting feature was introduced, it wasn't a significant issue for the cache to be updated by multiple concurrent calls. However, with rate limiting in place, it has become problematic. When the first call is blocked due to a lack of tokens in the bucket, initiating a second call that would consume the same tokens only slows down the completion of the first call. Parallel updates result in an increasing number of concurrent calls, leading to high memory usage.

Perhaps, if one call is in progress, subsequent calls should wait. Alternatively, a different approach could involve having separate logic responsible for fetching data, with the /metrics endpoint always serving cached data and never attempting to update anything. However, if the /metrics endpoint is executed only once per hour, having a separate logic that constantly checks the status of the all secrets might be a waste of resources.

Convert to a Kubernetes Operator

One option would be to convert the exporter into a Kubernetes Operator. If the Operator was to watch secret objects, there would be no need to fetch all secrets regularly. Upon starting, the Operator should fetch all secrets initially. This logic should still support rate limiting to prevent overloading the Kubernetes API server. Subsequently, the Operator would respond to events such as create, delete, or modify operations on secrets, and refresh the cache only when there are actual changes.

Summary

We solved the problem by adjusting the exporter configuration. We now understand how the exporter works and the processing bandwidth for our setup. I opened this issue to share some thoughts on how it could be improved to make it more resilient to such situations.

If you would like me to split this into multiple issues or if you need any additional details, please let me know 🙂

enix / x509-certificate-exporter