x-pack/filebeat/input/entityanalytics/provider/okta: Rate limiting hangs

chrisberkhout commented 4 months ago

Around the time of a rate limit reset, the rate limiting code may set a negative target rate and wait forever before the next request.

The rate comes out negative if current time is after the reset time returned in response headers. If a negative rate is set at a time when no request budget has accumulated, it will not recover. How previous events and timing affect the outcome can be seen in this example code.

We can avoid setting a negative rate by changing == 0 to <= 0 here.

There are some other corrections that can be made to the rate limiting logic, listed below. Beyond correctness, there are improvements that could be made for better operability, fault tolerance and user feedback.

A similar set of changes should be considered in the CEL input and related Mito code (in OktaRateLimit and in DraftRateLimit).

### Fix rate limiting logic - https://github.com/elastic/beats/pull/41583
- [x] Avoid setting a negative rate
- [ ] Stop requests until reset rather than doing one more burst when `x-rate-limit-remaining: 0`
- [ ] Use a separate rate limiter for each endpoint (`/api/v1/users` vs `/api/v1/users/<userid>/groups`)

### Improve operability
- [x] Add debug logging around waits #40347
- [ ] Add a timeout to the context passed into `x/time/rate`, so very long waits return an error immediately
- [ ] Add configuration options to override normal rate limiting with a constant rate, for use as a workaround

### Improve fault tolerance and user feedback
- [ ] Don't fail sync on HTTP 429
- [ ] Consider any adjustments appropriate to handle [429 responses for exceeding the concurrent rate limit](https://developer.okta.com/docs/reference/rl-best-practices/#example-rate-limit-header-with-concurrent-rate-limit)
- [ ] Publish events progressively rather than after receiving all data for a full sync, which may take hours

elasticmachine commented 4 months ago

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

chrisberkhout commented 4 months ago

Stop requests until reset rather than doing one more burst when x-rate-limit-remaining: 0

Setting a negative rate in the distant past avoids generating new tokens for the period from now until waitUntil (if the rate was zero a concurrent caller could still consume the burst): <-- True, but it won't clear existing tokens

diff --git a/x-pack/filebeat/input/entityanalytics/provider/okta/internal/okta/okta.go b/x-pack/filebeat/input/entityanalytics/provider/okta/internal/okta/okta.go
index 58495cbcd6..b33c7be7e6 100644
--- a/x-pack/filebeat/input/entityanalytics/provider/okta/internal/okta/okta.go
+++ b/x-pack/filebeat/input/entityanalytics/provider/okta/internal/okta/okta.go
@@ -444,6 +444,7 @@ func oktaRateLimit(h http.Header, window time.Duration, limiter *rate.Limiter) e
        // estimate will be overwritten when we make the next
        // permissible API request.
        next := rate.Limit(lim / window.Seconds())
+       limiter.SetLimitAt(time.Time{}, rate.Limit(-1))
        limiter.SetLimitAt(waitUntil, next)
        limiter.SetBurstAt(waitUntil, burst)
        return nil

elastic / beats

x-pack/filebeat/input/entityanalytics/provider/okta: Rate limiting hangs #40106