crossplane / crossplane-runtime

A set of libraries for writing Crossplane controllers.
https://crossplane.io
Apache License 2.0
146 stars 97 forks source link

Consider exposing poll interval in ExternalClient #253

Open hasheddan opened 3 years ago

hasheddan commented 3 years ago

What problem are you facing?

Currently, all controllers that use the managed reconciler have a default poll interval of 1 minute, which can be overridden during controller setup. However, this means that all objects reconciled by that controller are forced to use the same interval, which may not make sense (e.g. a production database may need to be checked for drift more frequently then a staging one).

How could Crossplane help solve your problem?

Crossplane users should be able to define the happy path poll interval for all of their resources. This could be accomplished by exposing pollInterval field embedded in every managed resource, then getting the value when returning with RequeueAfter in the managed reconciler. Another option would be to expose the field on a ProviderConfig, which @muvaf has pointed out could make sense due to the fact that rate limiting behavior is typically scoped to a single account on providers, such as AWS.

A third option would be to just define an additional method on the ExternalClient interface, potentially something like GetPollInterval() that would allow provider authors to determine how the poll interval is chosen. It could be a field on the ProviderConfig or managed resource, provided as an argument to controller entrypoint, or some combination of the three with different options taking precedence over others.

To encourage consistency, it would be good for us to suggest a common implementation, likely by adding pollInterval fields to the embedded ProviderConfig and managed resource structs, as well as supplying a default GetPollInterval() that accepted both objects (and potentially a hardcoded default fallback) and then selected one at a chose precedence, while also allowing provider authors to override this behavior.

negz commented 3 years ago

I'd prefer to start simple here, at least until we have more confidence that folks would benefit from different poll intervals for different individual managed resources (or kinds of managed resources). I personally suspect adding a --poll-interval provider flag to each provider that sets WithPollInterval for all of the provider's controllers would be a huge win that requires little more than a touch of plumbing.

If we did want to expose this at the ProviderConfig level (which does seem like the logical next most flexible place to do this after the provider level), or even the managed resource level, I'd prefer to do that by updating the the return value of the ExternalConnector, which is typically what first reads the managed resource and its provider config. For example:

// An ExternalConnection represents a connection to an external system
type ExternalConnection struct {
    Client        ExternalClient
    PollInterval  time.Duration

    // Using a struct future proofs us in case we want to surface
    // other things up to the managed.Reconciler in future.
}

// An ExternalConnecter produces a new ExternalClient given the supplied
// Managed resource.
type ExternalConnecter interface {
    // Connect to the provider specified by the supplied managed resource.
    Connect(ctx context.Context, mg resource.Managed) (ExternalConnection, error)
}
hasheddan commented 3 years ago

@negz putting it in ExternalConnection seems good to me :+1: I would advocate for going ahead and doing that as I think a fair number of community members have already asked for this functionality at the more granular resource level, but I'll surface this to community and ask them to show support here if this functionality is desired.

smcavallo commented 3 years ago

We have a lot of different tools which are making API calls to poll our infrastructure and it is often a challenge to configure the api calls to prevent rate limiting. The ability to change the polling interval at the global level would be extremely helpful. (We would probably default to 5m.) It would be useful to change the interval at the service level. For example RDS instances for us are fairly static once created - there's no need for us to poll them every minute. Either way though, whichever method is provided, this would be great and would be a help. The one thing we would like, though, is to be able to set the polling limit using CRON instead of timer based. Ex. */10 * * * * would give us a ton of flexibility and control over when the api requests are made.

mcavoyk commented 3 years ago

In our use, we haven't come across the need to manage the poll interval at the resource level. Seems like controller-setup ends up being the place where the most knowledge of provider rate limiting can be applied. Provider flag or ProviderConfig both seem like good options for further customization outside controller setup.

benagricola commented 3 years ago

Yep in cases where I've thought this would be useful it's at the ProviderConfig or Provider level, never at the individual resource level itself.

I like the ExternalConnection approach as it would work for both of the use-cases I can think of - keeping within known API rate limits, and speeding up reconciliation of resources referencing other resources, but still configurable by the provider.

ytsarev commented 3 years ago

ProviderConfig level implementation seems like a great start and is something we really need. Obviously, the ability to override it at the managed resource level as the RDS instance example above would be also useful.

janwillies commented 3 years ago

We were seeing some excessive calls being made and provider-level rate liming would be a great start to tackle this 👍

muvaf commented 3 years ago

+1 to ProviderConfig with a default that can be controllable via flag given in Provider(slight preference) or ControllerConfig.

srueg commented 3 years ago

We also have issues with the default poll interval (or rather shortWait and longWait). Since we have more than 2000 composites and managed resources a reconcile run of all of them takes longer than the configured longWait (usually 1 minute). This means that the controller is constantly reconciling and hitting its rate limit. This massively delays reconcilation of new/changed resources (up to 6 minutes in some cases).

Also see https://github.com/crossplane/crossplane/issues/2256

So I'd definitely welcome the possibility to configure the provider's poll interval. Additionally it might make sense to also configure it for the core Crossplane reconcilers (especially the composite reconciler).

srueg commented 3 years ago

As commented here: https://github.com/crossplane-contrib/provider-sql/pull/36#issuecomment-817538531

I don't really see the difference between the poll interval and the already existing --sync (SyncPeriod)?

negz commented 3 years ago

@srueg It does seem like there is a subtle difference here. Specifically:

  1. SyncPeriod seems to work by setting the ResyncPeriod of the controller-runtime's cache. The cache is a bunch of informers that watch any type the client is interested in, with a re-list every ResyncPeriod to make sure the watches didn't miss anything. So one difference is that setting SyncPeriod to something like 60 seconds would cause the controller-runtime client to re-list every type it was concerned with every 60 seconds.
  2. We'd like to be able to set a different poll interval for each controller, which SyncPeriod doesn't easily support (I think we'd need to create a different client/cache for each controller to do so, rather than using the client returned by mgr.GetClient()).

As @hasheddan pointed out to me, controller-runtime recommends the approach we take with PollInterval for our use case in which we want to poll things we cannot watch (i.e. external APIs) per https://github.com/kubernetes-sigs/controller-runtime/blob/a1e2ea2b17c3f64b1a060234863282cb330473c6/pkg/manager/manager.go#L127

That all said, I do think the relationship between PollInterval and SyncPeriod is confusing, especially because we typically expose the latter as a flag and not the former. I certainly did a double take thinking we'd accidentally reinvented a wheel when you asked this question. :)

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.