Unexpected rejections with rate limiters that allow more than a billion requests/second and are accessed from more than one thread

rkday-pro commented 1 year ago

Here is a minimal repro:

use governor::{
    clock::QuantaClock,
    middleware::StateInformationMiddleware,
    state::{InMemoryState, NotKeyed},
    Quota, RateLimiter,
};
use std::sync::Arc;
use std::thread;

fn rlspin(rl: Arc<RateLimiter<NotKeyed, InMemoryState, QuantaClock, StateInformationMiddleware>>) {
    for _ in 0..1_000_000 {
        rl.check().map_err(|e| dbg!(e)).unwrap();
    }
}

fn main() {
    let clock = QuantaClock::default();
    let quota = Quota::per_second(1_000_000_001.try_into().unwrap());
    dbg!(quota);

    let rate_limiter: Arc<
        RateLimiter<NotKeyed, InMemoryState, QuantaClock, StateInformationMiddleware>,
    > = Arc::new(
        RateLimiter::direct_with_clock(quota, &clock)
            .with_middleware::<StateInformationMiddleware>(),
    );

    let rate_limiter2 = rate_limiter.clone();

    thread::spawn(move || {
        rlspin(rate_limiter2);
    });
    rlspin(rate_limiter);
}

(with a Cargo.toml that just depends on governor 0.6.0)

This reliably fails for me:

$ cargo run --release
   Compiling govtest v0.1.0 (/home/robday/src/mntnlake/govtest)
    Finished release [optimized] target(s) in 0.98s
     Running `target/release/govtest`
[src/main.rs:19] quota = Quota {
    max_burst: 1000000001,
    replenish_1_per: 0ns,
}
[src/main.rs:12] e = NotUntil {
    state: StateSnapshot {
        t: Nanos(0ns),
        tau: Nanos(0ns),
        time_of_measurement: Nanos(224.941µs),
        tat: Nanos(224.941µs),
    },
    start: QuantaInstant(
        Nanos(170098.543484062s),
    ),
}
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: NotUntil { state: StateSnapshot { t: Nanos(0ns), tau: Nanos(0ns), time_of_measurement: Nanos(224.941µs), tat: Nanos(224.941µs) }, start: QuantaInstant(Nanos(170098.543484062s)) }', src/main.rs:12:41
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

However, if I change line 18 to let quota = Quota::per_second(1_000_000_000.try_into().unwrap()); it reliably passes:

$ cargo run --release
   Compiling govtest v0.1.0 (/home/robday/src/mntnlake/govtest)
    Finished release [optimized] target(s) in 0.88s
     Running `target/release/govtest`
[src/main.rs:19] quota = Quota {
    max_burst: 1000000000,
    replenish_1_per: 1ns,
}
$

and if I comment out the thread::spawn it also passes:

$ cargo run --release
   Compiling govtest v0.1.0 (/home/robday/src/mntnlake/govtest)
warning: unused import: `std::thread`
 --> src/main.rs:8:5
  |
8 | use std::thread;
  |     ^^^^^^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

warning: unused variable: `rate_limiter2`
  --> src/main.rs:28:9
   |
28 |     let rate_limiter2 = rate_limiter.clone();
   |         ^^^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_rate_limiter2`
   |
   = note: `#[warn(unused_variables)]` on by default

warning: `govtest` (bin "govtest") generated 2 warnings (run `cargo fix --bin "govtest"` to apply 2 suggestions)
    Finished release [optimized] target(s) in 0.73s
     Running `target/release/govtest`
[src/main.rs:19] quota = Quota {
    max_burst: 1000000001,
    replenish_1_per: 0ns,
}

I hit this when attempting to disable rate-limiting by configuring a rate-limit of u32::MAX, so I have an easy workaround of configuring a rate-limit of one billion, which is much higher than I need - but I thought you might appreciate the bug report anyway!

antifuchs commented 12 months ago

Oh man, this is ... hilarious, thank you for reporting it. I'll try and think about what it'll take to work around the issue (but if you have more than one event per nanosecond, governor will likely not be fast enough for you 😂)!

antifuchs commented 12 months ago

That said, I recognize that you're looking for a way to disable the rate limiting logic - maybe it's possible to provide an API that simply returns a success for every element (inlined, to cost as little as possible); is that a thing you'd use there? (And if so, what would you use it for?)

rkday-pro commented 12 months ago

Essentially I'm using governor as part of a broader server framework, and I want it to be able to offer rate-limiting, but I want it to be permissive by default so people don't try the framework, see it starts reporting (rate-limiting) errors after xyz,000 messages, and assume it has performance problems.

Setting the default limit to one billion is perfectly suitable for that, now that I've figured out I need to do it (we don't have more than one event per nanosecond!) so I don't think you need to worry about additional APIs for this use-case.

Interpreting any value over one billion as one billion (or conversely, fixing up any 0ns intervals to the minimum of 1ns), maybe with a warning log saying so, would be a good fix here from my perspective.

antifuchs commented 11 months ago

OK, so taking a closer look at this, I can't actually reproduce the issue in tests (I added this in tests/direct.rs):

#[cfg(feature = "std")]
#[test]
fn stresstest_large_quotas() {
    use std::{sync::Arc, thread};

    use governor::middleware::StateInformationMiddleware;

    let quota = Quota::per_second(nonzero!(1_000_000_001u32));
    let rate_limiter =
        Arc::new(RateLimiter::direct(quota).with_middleware::<StateInformationMiddleware>());

    fn rlspin(rl: Arc<DefaultDirectRateLimiter<StateInformationMiddleware>>) {
        for _ in 0..100_000_000 {
            rl.check().map_err(|e| dbg!(e)).unwrap();
        }
    }

    let rate_limiter2 = rate_limiter.clone();
    thread::spawn(move || {
        rlspin(rate_limiter2);
    });
    rlspin(rate_limiter);
}

Reading the state store code, I also can't see where this would fail: a 0ns replenishment interval would mean the "expected arrival time" never moves forward, which should mean anything being checked would pass... and it does, in this test!

antifuchs commented 11 months ago

~Uh, tell a lie: It's failing on nightly but not on stable: https://github.com/antifuchs/governor/actions/runs/6198549159/job/16829195167?pr=207#step:7:258 - what version are you testing with?~

My mistake, they all fail/flake in CI (so... linux, vs. my personal dev machine which is an M1 max macbook pro); sooooo, is this a linux/macOS thing? What OS are you running these tests on?

antifuchs commented 11 months ago

Hah, ok, the quota was completely innocent, it was the GCRA constructor-from-quota that set tau to 0, which then results in a burst size of 0. I think I've got a fix in #207, which passes the test I posted above.

boinkor-net / governor

Unexpected rejections with rate limiters that allow more than a billion requests/second and are accessed from more than one thread #203