Improve performance for attribute access error on large records

john-h-kastner-aws commented 8 months ago

Describe the improvement you'd like to request

Attempting to access an non-existing attribute on a large record causes unexpectedly high latency for authorization requests.

In the following benchmarks, I construct a context record containing 100,000 attributes and then evaluate two policies, one that accesses an existing attribute, and another that access an attribute that does not exist, resulting in an evaluation error. The error case has much worse performance, taking over a millisecond while the non-error case is measured in nanoseconds.

is_authorized large_context_record/get-attr
                        time:   [739.76 ns 745.21 ns 752.30 ns]
                        change: [-12.561% -11.919% -11.215%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  5 (5.00%) high mild
  8 (8.00%) high severe

Benchmarking is_authorized large_context_record/get-attr err: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.2s, enable flat sampling, or reduce sample count to 50.
is_authorized large_context_record/get-attr err
                        time:   [1.8045 ms 1.8104 ms 1.8164 ms]
                        change: [-1.7403% -1.1954% -0.6567%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

This behavior is due to error case rebuilding the context record as a Vec of its keys on this line:

https://github.com/cedar-policy/cedar/blob/57d3463937aafd907ea22602bf85dd63e8764d14/cedar-policy-core/src/evaluator.rs#L774

Replacing this line with vec![] and re-running the benchmarks shows much better times, but we would like to keep the nicer error message if possible.

is_authorized large_context_record/get-attr
                        time:   [733.44 ns 733.63 ns 733.92 ns]
                        change: [-2.3506% -1.5131% -0.8134%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
is_authorized large_context_record/get-attr err
                        time:   [849.06 ns 855.66 ns 869.37 ns]
                        change: [-99.953% -99.953% -99.953%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

Benchmark used:

const LARGE_SIZE: usize = 100000;
pub fn large_context_record(c: &mut Criterion) {
    let context = Context::from_pairs(
        (1..=LARGE_SIZE)
            .into_iter()
            .map(|i| (format!("a{i}"), RestrictedExpression::new_bool(true)))
            .collect::<Vec<_>>(),
    )
    .unwrap();
    let req = Request::new(None, None, None, context, None).unwrap();
    let entities = Entities::empty();
    let auth = Authorizer::new();

    let mut group = c.benchmark_group("is_authorized large_context_record");

    let policy = PolicySet::from_str(
        r#"
        permit( principal, action, resource)
        when { context.a1 };
    "#,
    )
    .unwrap();
    group.bench_function("get-attr", |b| {
        b.iter(|| auth.is_authorized(black_box(&req), black_box(&policy), black_box(&entities)))
    });

    let policy = PolicySet::from_str(
        r#"
        permit( principal, action, resource)
        when { context.other };
    "#,
    )
    .unwrap();
    group.bench_function("get-attr err", |b| {
        b.iter(|| auth.is_authorized(black_box(&req), black_box(&policy), black_box(&entities)))
    });

    group.finish();
}

Describe alternatives you've considered

No response

Additional context

No response

Is this something that you'd be interested in working on?

[ ] 👋 I may be able to implement this internal improvement
[ ] ⚠️ This feature might incur a breaking change

cdisselkoen commented 6 months ago

I suggest we store only the first 5 (or some other N) attribute names in the available_attrs field of the error, maybe with a usize indicating how many total attributes there were. This is possibly a breaking change, so we could make it for 4.0 (along with the other error restructuring in #745).

john-h-kastner-aws commented 6 months ago

Resolved by #887

cedar-policy / cedar