proposal: Add tracking latencies and flamegraphs in CI

oschaaf commented 4 years ago

Filing this issue to get a feel for interest in this

Goal:

Add a means to track and persist latency numbers and perf visualizations like flamegraphs over time in CI. This would allow us to track how we're doing over time as well as have perf information at hand when a latency regression is observed.

Description:

Nighthawk uses a lightweight python-based framework for integration testing. This framework serves as a basis for writing NH's own benchmarks.

With a small bit of modification this could be modified to:

make consumption very low friction in foreign code bases (like Envoy)
allow it to inject proxies in the between the client and test server. For example, Envoy at a certain sha
scavenge tests from external locations

More details, and some concrete scripts for getting an idea of what this would look like can be found here.

/cc @danzh2010 @htuch

htuch commented 4 years ago

I think even redline QPS would be an amazing contribution here, everything else proposed seems like gravy. +1000.

mattklein123 commented 4 years ago

See https://github.com/envoyproxy/envoy/issues/961. I desperately want this. This will require a lot of thought in terms of how to structure repeatable tests, but yes, we really need to do this.

antoniovicente commented 4 years ago

I don't know if this exists yet, but it would also be good to have a few relatively small benchmark scenarios that can be used for A/B comparison of performance after changes to data plane components, specially in cases where we expect some performance impact. Tracking performance data for the small benchmarks over time on a calibrated environment would be great.

oschaaf commented 4 years ago

I have started exploring this. Tracking progress here. @antoniovicente It might be good to take a look at test_benchmarks.py, to see if that allows enough flexibility. The idea is that consumers can specify their own locations where the suite should scavenge tests, which in turn can supply custom fixtures with custom Envoy configurations.

mattklein123 commented 4 years ago

cc @marcomagdy who is also interested in helping with this effort.

snowp commented 4 years ago

We'd also be interested in this, so let me know how I can help to move this forward

oschaaf commented 4 years ago

Update: a good part of this is in review over at https://github.com/envoyproxy/nighthawk/pull/337.

Nighthawk is eating its own dogfeed via a new CI task, and is dropping simple visualizations per test (example).

Cpu profiles are also collected, but flamegraphing needs more work as we need to consider the binaries and libraries involved in generating the profile to get sensible output for that.

antoniovicente commented 4 years ago

Any updates?

oschaaf commented 4 years ago

Well, I got sidetracked for a but, but this has been happily test-driving in NHs own CI. So far so good.

For example see the .html files in the artefacts of a recent PR. We could consider wiring up the current state in Envoy's CI as an MVP based on the docker-based flow (Nighthawk's CI runs with its locally produced binaries). This should be pretty doable, but I would appreciate help/guidance there.

Some important improvements that others have expressed interest in tackling are:

The current UI is limited to a directly listing of artefacts as offered by the CI env (CircleCI in NH).
There's no regression analysis / detection.

For more detailed status, see https://github.com/envoyproxy/nighthawk/tree/master/benchmarks#todos

abaptiste commented 4 years ago

Hello Folks. We have a design doc for a framework that we'd like your comments on:

https://docs.google.com/document/d/14Iz8j--Mvb06QFB8RurtYlwmy657YbAVfqDr-jKgtaQ/edit#heading=h.grkfe6onmtgv

htuch commented 4 years ago

@abaptiste thanks. My super high-level comment is that as a developer and performance engineer (user story), I'd like to be able to have control over the benchmark execution environment. So, any framework should be capable of running 100% locally. It's fine to make it also available as a SaaS via buckets or e-mail, but I think we're limiting applicability if those are the only options.

mattklein123 commented 4 years ago

@abaptiste thanks. My super high-level comment is that as a developer and performance engineer (user story), I'd like to be able to have control over the benchmark execution environment. So, any framework should be capable of running 100% locally. It's fine to make it also available as a SaaS via buckets or e-mail, but I think we're limiting applicability if those are the only options.

+1 I left a bunch of comments around this. I also want to make sure we have a clear post-MVP path for CI integration as IMO this is the thing we really want to unlock ASAP. Thank you for working on this!

abaptiste commented 4 years ago

Thank you for the comments. These are the major themes I've captured:

Define all JSON schemas using proto3 messages (this will be done as part of the MVP)
We need a better authentication mechanism
CI integration so that builds run nightly or upon master check-in
Long term storage of results so that we can chart the performance of prior builds
Ability to do performance runs complementing local development

If there are additional items I may have inadvertently missed or misunderstood, please let me know.

mattklein123 commented 4 years ago

@abaptiste that list LGTM and also similar to our offline conversation. Thanks for working on this! This will be awesome.

abaptiste commented 3 years ago

I posted a separate doc based on the feedback from the initial review. Please feel free to take a look and comment.

htuch commented 3 years ago

@abaptiste the new doc LGTM, tagging @oschaaf @mattklein123 @antoniovicente @mum4k @pamorgan @snowp for comments/sign-off.

mattklein123 commented 3 years ago

I looked at the doc and at a high level it looks great to me. Very excited for this work!

oschaaf commented 3 years ago

Looks good to me!

gyohuangxin commented 2 years ago

Any updates? We are interested in the integration, is there any help I can offer?

mum4k commented 2 years ago

Hi @gyohuangxin, we will gladly accept help. We expect to be able to staff this work in about 6 months, but I would gladly work with you in the meantime if you have the cycles. If you are able to help, it would be good to get in touch and discuss priorities and the direction. Are you on the Envoy's Slack by any chance?

gyohuangxin commented 2 years ago

@mum4k Thank you! yes, let's discuss on Slack.

keithmattix commented 1 month ago

What's the latest on this effort? This would be extremely beneficial

mum4k commented 1 month ago

This effort has been de-staffed temporarily. If there is anyone who wants to pick it up in the meantime, I will gladly transfer the latest state and/or guide, review code as desired.

envoyproxy / envoy

proposal: Add tracking latencies and flamegraphs in CI #11266

Goal:

Description: