backstage / backstage

Backstage is an open framework for building developer portals
https://backstage.io/
Apache License 2.0
26.61k stars 5.46k forks source link

Add `clear` and `iterator` methods to the `CacheService` #24617

Open Andy2003 opened 2 weeks ago

Andy2003 commented 2 weeks ago

This PR adds two methods to the CacheService:

:heavy_check_mark: Checklist

backstage-goalie[bot] commented 2 weeks ago

Changed Packages

Package Name Package Path Changeset Bump Current Version
@backstage/backend-common packages/backend-common minor v0.22.0-next.1
@backstage/backend-plugin-api packages/backend-plugin-api minor v0.6.18-next.1
github-actions[bot] commented 2 weeks ago

Uffizzi Ephemeral Environment - Virtual Cluster

Your cluster pr-24617 was successfully created. Learn more about Uffizzi virtual clusters To connect to this cluster, follow these steps:

  1. Download and install the Uffizzi CLI from https://docs.uffizzi.com/install
  2. Login to Uffizzi, then select the backstage account and project:
    uffizzi login
Select an account: 
  ‣ backstage
    jdoe

Select a project or create a new project: 
  ‣ backstage-6783521
  1. Update your kubeconfig: uffizzi cluster update-kubeconfig pr-24617 --kubeconfig=[PATH_TO_KUBECONFIG] After updating your kubeconfig, you can manage your cluster with kubectl, kustomize, helm, and other tools that use kubeconfig files: kubectl get namespace --kubeconfig [PATH_TO_KUBECONFIG]

Access the backstage endpoint at https://backstage-default-pr-24617-c5387.uclusters.app.uffizzi.com

Andy2003 commented 2 weeks ago

Thank you for the suggestion! 👍

Could you elaborate on why you're requesting this and how you plan to use it? My immediate gut feeling is that both of these new methods go against the intentions of the cache service, and it might be better to use the database instead.

We want to use this, to cache some backend responses in a clustered / distributed way (we need this b/c we get 429 Too many responses from a connected service). Currently we are using a LRU cache (which is not working clustered).

The clear method should be used for admin purposed, so we can invalidate this cache if required. The iterate method is used to extract e.g. the timestamp of the oldest entry in the cache.

I used quite a lot caches in the past (especially in java) and such methods are common.

freben commented 2 weeks ago

I'm hesitant too. The methods presented here must be supported, somewhat efficiently, by every future implementation and underlying storage.

This is also a plugin scoped service, that often uses a single underlying instance of for example memcached. Suppose that there are two plugins in an example setup: one which stores 50 gigabytes of fast changing data and then a second plugin which has a handful of keys. When the second one is iterating, does the underlying store permit fast streaming based on key prefix (if at all) or will the client have to filter on the caller side? Can it easily and efficiently issue clears of subsets of keys that apply to only the current plugin?

Maybe this is just hypothetical and not a problem in practice. But just to illustrate my thoughts.

Andy2003 commented 2 weeks ago

The clear and iterator methods are optional features, and developers can choose whether or not to use them based on their specific use case and data size. For large datasets, it may be more efficient to avoid using these methods altogether, reducing potential overhead.

Implementing these methods in cache systems shouldn't be a challenge for future developments, as they're standard features in cache implementations. These functionalities are common, offering flexibility and compatibility with existing practices, ensuring smooth integration and efficient cache management.

freben commented 2 weeks ago

They're not optional in the sense that the methods are there in the interface for all to use though. If some implementations throw not-implemented exceptions it undermines their usability, for example.

As far as I know, memcached does not support subset iteration or subset clears right? You can theoretically issue a cachedump but that gives you the whole universe. So if they have to be implemented extremely inefficiently, and your plugin uses this feature, it might seem to work well today but then when you install a second plugin that heavily uses the cache, your plugin (or both) will suffer.

Look, I'm not necessarily meaning this as a hard no at all. I just illustrate why I'm wary of it. Databases can typically do what you need at really fast rates with an order by on the timestamp, instead of scanning linearly, which is why we tend to reach for that fairly eagerly. Maybe if we merge this we might put a big warning in doc comments that they may be inefficient on some storage engines?

freben commented 2 weeks ago

Either way, you'll need to amend the tests. I'm not sure the implementation is correct. You'd want to make two clients, representing different plugins, and populating them with data, and ensuring that their iteration and clears do not "spill over" into the other plugins keyspace.

Unless you did intend for these to really be admin style operations, having global effects. But then they break the encapsulation that we typically try to have between plugins

Andy2003 commented 1 week ago

I had to add a safeguard that throws an error as soon as the key is hashed. Maybe we should go with 3 methods: keys, values and entries, what do you think?

Rugvip commented 1 week ago

We want to use this, to cache some backend responses in a clustered / distributed way (we need this b/c we get 429 Too many responses from a connected service). Currently we are using a LRU cache (which is not working clustered).

Is this in the catalog? it's already built into the UrlReaderProcessor so you'll want to make sure you don't end up with a duplicate implementation.

The clear method should be used for admin purposed, so we can invalidate this cache if required. The iterate method is used to extract e.g. the timestamp of the oldest entry in the cache.

What I think we could do is add the methods to the implementation class, the CacheClient, but not to the service interface, CacheService. That way you can set up your own internal thing where you can wire up some form of admin actions without needing to reimplement the entire cache manager + client, but we avoid adding these methods to the CacheService interface.

Andy2003 commented 1 week ago

Is this in the catalog? it's already built into the UrlReaderProcessor so you'll want to make sure you don't end up with a duplicate implementation.

no it's a 3rd party service

What I think we could do is add the methods to the implementation class, the CacheClient, but not to the service interface, CacheService. That way you can set up your own internal thing where you can wire up some form of admin actions without needing to reimplement the entire cache manager + client, but we avoid adding these methods to the CacheService interface.

So we need to export DefaultCacheClient or we create a 2nd interface for like InternalCacheService so I can cast to it:

const client = (InternalCacheService) cacheMananger
          .forPlugin(`foo`)
          .getClient(options);
Andy2003 commented 1 week ago

As far as I know, memcached does not support subset iteration or subset clears right?

You are right, memcached is very special. I tested it locally and the tests fails for the new methods. It does not even provide the iterate method via keyv. And the clear is deleting all the data.

Rugvip commented 1 week ago

Had another discussion around this in the maintainer team and consensus is that it's best not to add this. The value added by having these available in the implementation we perceive to be quite limited. Often the individual underlying caches will have their own admin interfaces, and you can always set up your own connection to the cache that then lets you use all of the available commands.