tobz commented 2 weeks ago

Context

Work in progress.

pr-commenter[bot] commented 2 weeks ago

Regression Detector (DogStatsD)

Regression Detector Results

Run ID: 75da82cf-b1bb-42cb-8e8e-f612161a3ad1

Baseline: 7.55.2 Comparison: 7.55.3

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links | |------|----------------------------------------------|--------------------|----------|----------------|--------|-------| | ➖ | dsd_uds_100mb_3k_contexts_distributions_only | memory utilization | +1.28 | [+1.12, +1.44] | 1 | | | ➖ | dsd_uds_10mb_3k_contexts | ingress throughput | +0.03 | [+0.00, +0.07] | 1 | | | ➖ | dsd_uds_500mb_3k_contexts | ingress throughput | +0.00 | [-0.01, +0.01] | 1 | | | ➖ | dsd_uds_100mb_3k_contexts | ingress throughput | +0.00 | [-0.03, +0.03] | 1 | | | ➖ | dsd_uds_1mb_3k_contexts | ingress throughput | +0.00 | [-0.00, +0.00] | 1 | | | ➖ | dsd_uds_100mb_250k_contexts | ingress throughput | -0.00 | [-0.02, +0.02] | 1 | | | ➖ | dsd_uds_1mb_50k_contexts | ingress throughput | -0.02 | [-0.05, +0.02] | 1 | | | ➖ | dsd_uds_1mb_50k_contexts_memlimit | ingress throughput | -0.02 | [-0.06, +0.02] | 1 | | | ➖ | dsd_uds_512kb_3k_contexts | ingress throughput | -0.02 | [-0.06, +0.02] | 1 | |

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI". For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true: 1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look. 2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that *if our statistical model is accurate*, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants. 3. Its configuration does not mark it "erratic".

pr-commenter[bot] commented 2 weeks ago

Regression Detector (Saluki)

Regression Detector Results

Run ID: 88e47a21-9e72-443c-8c89-3338872bb552

Baseline: c1acd462d9365f0c1ce55a5a2cc4db053ce91e47 Comparison: 539f1c9440c12d1a3e2daa72a869f91eda042e02

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

Significant changes in experiment optimization goals

Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%

perf	experiment	goal	Δ mean %	Δ mean % CI	trials
❌	dsd_uds_100mb_3k_contexts_distributions_only	memory utilization	+19.49	[+19.22, +19.75]	1
✅	dsd_uds_1mb_50k_contexts_memlimit	ingress throughput	+13.00	[+9.82, +16.18]	1
❌	dsd_uds_100mb_250k_contexts	ingress throughput	-5.43	[-5.92, -4.93]	1

Fine details of change detection per experiment

| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links | |------|-------------------------------------------------|--------------------|----------|------------------|--------|-------| | ❌ | dsd_uds_100mb_3k_contexts_distributions_only | memory utilization | +19.49 | [+19.22, +19.75] | 1 | | | ✅ | dsd_uds_1mb_50k_contexts_memlimit | ingress throughput | +13.00 | [+9.82, +16.18] | 1 | | | ➖ | dsd_uds_1mb_3k_contexts | ingress throughput | +0.02 | [-0.01, +0.05] | 1 | | | ➖ | dsd_uds_100mb_3k_contexts | ingress throughput | +0.01 | [-0.02, +0.03] | 1 | | | ➖ | dsd_uds_50mb_10k_contexts_no_inlining_no_allocs | ingress throughput | +0.00 | [-0.03, +0.04] | 1 | | | ➖ | dsd_uds_512kb_3k_contexts | ingress throughput | +0.00 | [-0.04, +0.04] | 1 | | | ➖ | dsd_uds_1mb_50k_contexts | ingress throughput | -0.00 | [-0.00, +0.00] | 1 | | | ➖ | dsd_uds_50mb_10k_contexts_no_inlining | ingress throughput | -0.00 | [-0.00, +0.00] | 1 | | | ➖ | dsd_uds_10mb_3k_contexts | ingress throughput | -0.07 | [-0.12, -0.02] | 1 | | | ➖ | dsd_uds_500mb_3k_contexts | ingress throughput | -1.43 | [-1.50, -1.36] | 1 | | | ❌ | dsd_uds_100mb_250k_contexts | ingress throughput | -5.43 | [-5.92, -4.93] | 1 | |

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI". For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true: 1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look. 2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that *if our statistical model is accurate*, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants. 3. Its configuration does not mark it "erratic".

pr-commenter[bot] commented 2 weeks ago

Regression Detector Links

Experiment Result Links

experiment	link(s)
dsd_uds_100mb_250k_contexts	[Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_100mb_3k_contexts	[Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_100mb_3k_contexts_distributions_only	[Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_10mb_3k_contexts	[Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_3k_contexts	[Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_50k_contexts	[Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_50k_contexts_memlimit	[Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_500mb_3k_contexts	[Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_512kb_3k_contexts	[Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_50mb_10k_contexts_no_inlining (ADP only)	[Profiling (ADP)] [SMP Dashboard]
dsd_uds_50mb_10k_contexts_no_inlining_no_allocs (ADP only)	[Profiling (ADP)] [SMP Dashboard]

tobz commented 5 days ago

Just to jot down some notes here..

The two biggest problems are that what we really want to be able to do is:

avoid having to allocate in order to signal that a context is no longer used
leave the metric around for a little while (like a cache, with a TTL) so that we aren't just immediately blowing away resolved contexts

We can solve the first problem with Arc<T>-like semantics, just tracking when no outstanding reference to a context exists (besides our reference in the resolver) and then triggering the removal of that context... but that means while we're very precise about expiration, we actually expire too fast which means we spend gobs of time re-interning because of having to search through the interner.

If we made the interner O(1)-esque, then this might not be a problem... but doing so would also mean that it would be far less bounded than it currently is.

Likewise, we can trivially solve the second problem by just incrementally iterating over the resolved contexts, with sleeps in between, which isn't so much a true TTL as much as it simply introduces an inherently delay between a context becoming unused and being cleaned up. This, however, means that we either need to use a scheme that allows crawling the list in chunks (which will need locking) or crawling it in full, every time, which is naturally more and more expensive as the number of resolved contexts go up... and still isn't a true TTL.

I was trying to noodle around the idea of how to make the "signal that this context is now unused" bit super cheap, which would allow us to register it somewhere that could then try to do more of a true "has it been unused for more than X seconds?" check... but so far I haven't come up with something sufficiently simple and performant.

DataDog / saluki

[APR-205] chore: allow for contexts to be expired from `ContextResolver` #225

Context

Regression Detector (DogStatsD)

Regression Detector Results

No significant changes in experiment optimization goals

Fine details of change detection per experiment

Explanation

Regression Detector (Saluki)

Regression Detector Results

Significant changes in experiment optimization goals

Fine details of change detection per experiment

Explanation

Regression Detector Links

Experiment Result Links