feat: Add variants metrics to SDK

thomasheartman commented 1 year ago

This PR adds support for variant metrics to the Rust SDK. It's about 95 lines of implementation and 150 lines of tests.

Variant metrics should be counted whenever get_variant is called:

if a feature doesn't have variants, register the disabled variant
if a feature is disabled, register the disabled variant
if a feature doesn't exist, it should still register the disabled variant.
otherwise, register the variant that was returned

Discussion points

I've implemented the variant metrics on each feature instance as a DashMap (and thus also added a new crate). Currently, each feature will get a new DashMap initialized. It gets initialized with a capacity of 0. It may be worth making the DashMap an Option and only initializing it when it's requested, but that might require a fair bit of trickery (and suggestions on how to do it would be very welcome).

Tests

I've added tests to ensure that metrics are being counted and that the conversion from CachedFeature to ToggleMetrics works as expected. I've added new tests for these, but the tests for variant counting could be added to the existing get_variant tests. However, I felt that would muddy the tests, so I preferred creating new tests.

I have not tested the serde serialization output, because I don't think that would add much value. But I'd be happy to put in a test for that too if you think it would be useful.

Benches

Of course, benchmarks are a fickle thing, but I figured they'd be useful to make sure there's no giant blunders. Running it locally on my machine against main, I found that most benchmarks have no change or it is within the noise threshold. Three benches indicated performance improved, and three indicated that it had regressed.

Most of the improvements and regressions are =< ~6%, so I don't know how much stock to put in them. However, one regression, batch/parallel unknown-features(str) reported a throughput decrease of 24%, so that may be significant.

Performance improvements

While I don't see how this would improve performance in any way, the following benches were reported as improved:

single_call/single_call(str)
                        time:   [253.58 ns 254.46 ns 255.51 ns]
                        thrpt:  [3.9137 Melem/s 3.9299 Melem/s 3.9435 Melem/s]
                 change:
                        time:   [-3.5225% -2.9598% -2.3842%] (p = 0.00 < 0.05)
                        thrpt:  [+2.4425% +3.0501% +3.6511%]
                        Performance has improved.

batch/parallel distinct-features(enum)
                        time:   [24.728 ms 24.762 ms 24.838 ms]
                        thrpt:  [16.104 Melem/s 16.154 Melem/s 16.176 Melem/s]
                 change:
                        time:   [-6.5260% -5.8273% -5.1205%] (p = 0.00 < 0.05)
                        thrpt:  [+5.3969% +6.1878% +6.9816%]
                        Performance has improved.

batch/parallel unknown-features(enum)
                        time:   [2.9772 ms 2.9899 ms 3.0029 ms]
                        thrpt:  [133.20 Melem/s 133.79 Melem/s 134.35 Melem/s]
                 change:
                        time:   [-6.5641% -6.0584% -5.5674%] (p = 0.00 < 0.05)
                        thrpt:  [+5.8956% +6.4491% +7.0252%]
                        Performance has improved.

Performance regressions

The following regressions were reported:

batch/single thread(str)
                        time:   [12.815 ms 12.892 ms 12.974 ms]
                        thrpt:  [3.8537 Melem/s 3.8783 Melem/s 3.9016 Melem/s]
                 change:
                        time:   [+1.3055% +2.0763% +2.8471%] (p = 0.00 < 0.05)
                        thrpt:  [-2.7683% -2.0341% -1.2887%]
                        Performance has regressed.

batch/parallel same-feature(str)
                        time:   [28.019 ms 28.085 ms 28.170 ms]
                        thrpt:  [14.199 Melem/s 14.242 Melem/s 14.276 Melem/s]
                 change:
                        time:   [+1.2655% +1.8355% +2.4361%] (p = 0.00 < 0.05)
                        thrpt:  [-2.3782% -1.8025% -1.2497%]
                        Performance has regressed.

batch/parallel unknown-features(str)
                        time:   [4.2516 ms 4.2622 ms 4.2794 ms]
                        thrpt:  [93.470 Melem/s 93.849 Melem/s 94.082 Melem/s]
                 change:
                        time:   [+30.189% +31.927% +33.541%] (p = 0.00 < 0.05)
                        thrpt:  [-25.116% -24.200% -23.188%]
                        Performance has regressed.

sighphyre commented 1 year ago

Clippy is being your frenemy here, pretty sure it's picking up things that previous versions of Clippy didn't. Might have to fix those before we merge but I like it!

thomasheartman commented 1 year ago

@sighphyre As mentioned in the comment, I do think that the one significant regression may be related to dashmap initialization. I'll see if I can figure a way around it and if it changes anything.

As for clippy: yeah, I got those lints locally as well, but I didn't want to touch them because those lines were unrelated to the PR. Happy to fix them up, though!

thomasheartman commented 1 year ago

Edit: I was slightly off earlier. When you check is_enabled_str for a feature that doesn't exist, that's when we initialize a dashmap. That's gonna be harder to get around.

sighphyre commented 1 year ago

@sighphyre As mentioned in the comment, I do think that the one significant regression may be related to dashmap initialization. I'll see if I can figure a way around it and if it changes anything.

As for clippy: yeah, I got those lints locally as well, but I didn't want to touch them because those lines were unrelated to the PR. Happy to fix them up, though!

I think you might have to - we block the build on this one if Clippy complains

rbtcollins commented 1 year ago

Hey, so a few thoughts.

Firstly, great to see work coming in! Forgive my commenting just on the description, its all the time I have today.

I presume that this is not running conformance tests - we kind of need to fix up https://github.com/Unleash/unleash-client-rust/pull/42 and merge it and then repeat for the other commits to the conformance test suite.

Being up to date with that suite is really important to me because it lets us see that the official API works with new features, and I'd really encourage work on that before new features - because we have to manually test all new features and can't lean on the conformance test suite as much as we might otherwise.

Secondly, on dashmap - I need to look at the code I think. I think a rwlock in every get_variant call is a bit much - the basic ideal I think we're aiming for is 'smart if's', which means (to me) that it shouldn't matter how often unleash is called into, its not meaningfully worse than using 'if some_manual_feature_constant'. In particular the cold path concerns me - if the API is down, or someone deletes the toggle from the API, we shouldn't suddenly slow things down. Admittedly this SDK has a lot of headroom, but thats only because we've paid attention to overheads all the way :>.

In terms of code alternatives, I think generating an efficient lock-free structure from the API document, putting in the RCU ARC, and then indexing into it should be very efficient, lock tree, and quite easy to code.

thomasheartman commented 1 year ago

Closing this in favor of #60

Unleash / unleash-client-rust