freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
531 stars 147 forks source link

Add rate-limit or throttling to Percolator Search alerts Webhooks #4245

Open albertisfu opened 1 month ago

albertisfu commented 1 month ago

Currently, we limit the number of hits in an alert to 20 hits and up to 5 child documents per hit. This applies to both emails triggered by the Percolator and the Sweep index.

Webhooks triggered by the Sweep index have the same limits because they are constrained by the same matched search results.

However, webhooks triggered by the Percolator do not have this limitation. Instead, every document that matches an alert (after applying the corresponding filters) is sent as a Search alert Webhook in real-time.

We discussed that it would make sense to apply a rate-limiter or throttling to these webhooks, analyzing before what would be better for the use case.

mlissner commented 1 month ago

Thanks for filing this. I think there are a few goals here:

  1. Prevent people from making searches that trigger on too much stuff.
  2. Prevent abuse.
  3. Identify big users and charge them more.

I think the way to do this is:

  1. Have rate limits for each webhook event type-endpoint pair stored in the DB so they can be easily adjusted via the Admin page.

    Allow multiple rate limits per user, so that they can have 5/m and 10/h, say at the same time.

  2. Cache the rate limits in redis or in memory so that they don't have to be checked before each event. Maybe a five minute cache?

  3. Maintain rolling usage numbers in redis for speed, and so it's similar to the API.

That part seems pretty straightforward on the whole.

The next question is what the default rate(s) should be. We monitor the usage of webhooks and they're envisioned mostly for commercial orgs, so I think the answer is:

  1. Set a value that protects us from problems (DDOS, whatever).

  2. When an org starts using these, negotiate a number of webhook events for them, and set it to that value.

The final question is what to do when rates are exceeded. I think the kind thing to do is:

I think that largely covers the bases?