plombardi89 commented 6 years ago

The current E2E test harness has some issues we would like to address:

It does not execute in parallel.
The test harness itself is a hard to debug assembly of shell scripts.
Has its own share of flakiness issues (e.g. recently we noticed it failing but still reporting status code 0 and ultimately succeeding).
It's slow enough that it doesn't run on anything except RC builds.

With the E2E suite we want to design and implement something that will:

Execute quickly enough not to cause agony (understanding that E2E tests are slower than fail-fast system tests)
Easy to debug and extend the harness itself.
Easy to add new positive and negative test cases as needed.
Be mostly or entirely reusable when / if we decide to upgrade Ambassador to use Envoy v2 config or the Envoy ADS.

Design CoS

[ ] Discuss and document the current issues with the Ambassador E2E suite. Documentation can be appended to this issue.
[ ] Write up thoughts around a replacement or alternative to the E2E approach.
[ ] Design and prototype a new E2E suite. The person designing should at a minimum discuss the design with one other party before further implementation occurs.
[ ] Ensure the design has clear answers to the above four bullet points.

Implementation CoS

[ ] The new E2E suite should execute tests in parallel.
[ ] The new E2E suite should have all the current E2E tests ported into it.
[ ] The new E2E suite should be able to be run locally or in CI in the same way.
[ ] Documentation for how to extend or debug the harness.
[ ] Documentation for how to implement additional test cases.
[ ] Should be possible to selectively run some E2E tests.

Additional Nice To Haves

[ ] It would be nice if the E2E provisioning machinery could use Forge so we continue to eat our own dogfood.

kflynn commented 6 years ago

OK, here we go!

Ambassador Testing

Like most real-world software, Ambassador comprises multiple functional units that work together to get a job done, and we need to test all of them to have confidence that development is improving the product, rather than breaking it. Of course, not all the units can be tested the same way, which makes covering them entertainingly tricky.

Functional Units

At a very high level, Ambassador's functional units are:

The translation engine takes an Ambassador configuration and spits out an Envoy configuration. It mostly lives in ambassador/ambassador/config.py.
The Kubernetes interface watches Kubernetes resources to find an Ambassador configuration to hand to the translation engine. It mostly lives in ambassador/kubewatch.py, except for some bits that live in ambassador_diag/ambassador_diag/diagd.py.
The Envoy interface keeps track of whether Envoy is happy (ambassador/ambassador_diag/diagd.py) and manages restarting Envoy on config changes (ambassador/hot-restarter.py, ambassador/start-envoy.sh, and ambassador/kubewatch.py).
Finally, the diagnostic service shows Ambassador users a nice human-readable view into what Ambassador thinks is going on (ambassador/ambassador_diag/diagd.py, but it relies heavily on ambassador/ambassador/config.py).

Current Testing

Ambassador currently has two different kinds of tests:

Tests in ambassador/tests test the translation engine, which is pretty straightforward:
- load an Ambassador config from the filesystem
- have the translation engine write an Envoy config out to the filesystem
- see if it matches what we got last time
These tests use pytest; the drivers are in ambassador/tests/ambassador_test.py and ambassador/tests/corner_case_test.py.
Tests in end-to-end currently try to test the Kubernetes interface and the Envoy interface. More about these in a moment.
The diagnostic service is partly covered by the end-to-end tests, but its UI is basically untested right now. This is a problem, but not our immediate concern.

End-To-End: Current State

Right now we actually spin up a cluster and pump traffic through the Envoy that Ambassador has configured, but this conflates multiple kinds of tests:

We are actually testing Envoy's functionality.

This is actually kind of silly. We need to test our understanding of how to configure Envoy for a given task while we bring up new features, but once that's done, we probably can rely on Lyft for confidence that Envoy works.
We are testing Ambassador Kubernetes interface.

This isn't silly at all. It's not that hard to break kubewatch, for example, and we need to know if Ambassador suddenly starts taking twice as long to notice a change.
We are testing Ambassador's Envoy interface.

This isn't exactly silly (we do need to know that the restarter will restart Envoy when appropriate) but this isn't code that changes often, either: it's probably overkill to test it on every commit.
We are testing Ambassador's runtime stability (e.g. is it filling up the disk? leaking memory? spamming the logs?).

This also isn't silly, but it's probably not necessary on every commit, either. There's a difference between a soak test and a functional test.

An additional problem with the end-to-end tests is that they're written in this awful mix of shell and Python, using a bunch of shell utilities that are, shall we say, perhaps not the most graceful bits of code around. They end up duplicating a certain amount of Forge's functionality, so really we should just use Forge for those bits.

End-To-End: Looking Forward

It's worth calling out a few things that are important now, or will become important in the near future:

Spinning up a cluster is really slow.

Whatever we do with Kubernetes, we need to be able to do most or all of it without deleting and recreating whole clusters.
ADS is coming.

Whatever we do with Kubernetes, it needs to be amenable to a world where we're just feeding Kube events into something very different from what we have now. The initial ADS cut will almost certainly be a process that accepts a stream of incoming Ambassador configuration elements and performs ADS calls into a running envoy.
Hey wait a minute, we don't have to wait for ADS to shift to that model.

If you look at kubewatch.py and squint, you see that the bits that accept Kube events and translate them into Ambassador configuration elements are already orthogonal to the bits that manage generating the Envoy config and performing the hot restart. Formalizing that separation is a small step.

End-to-End Plan of Attack

Separate kubewatch explicitly into two pieces (I'm thinking of these as processes right now, but they could as easily be threads in a single process):
- A new configmgr process will:
  - start with an empty Ambassador configuration
  - accept requests to either
    - update a config element (argument is an Ambassador config resource)
    - delete a config element
  - generate a new Envoy config and restart envoy as appropriate
- kubewatch will change to:
  - watch various Kubernetes resources
  - generate Ambassador config resources from them
    - usually this will be pulling the resource from an annotation
    - might be a bit more involved for e.g. secrets
  - send updates to configmgr
Once that's done, we can test kubewatch pretty easily:
- start a Kubernetes cluster
- start kubewatch
- make a lot of service changes, etc.
- observe that kubewatch saw them all and gave them to configmgr correctly
(This is probably most simply done by starting real Ambassador, but writing some sort of dump request into configmgr.)
We can also test configmgr pretty easily, without Kubernetes:
- just run configmgr standalone
- force-feed it a bunch of events
  - we'll need to be able to capture these from kubewatch
- see if configmgr generates the IR that we expect

Adding new kubewatch tests will probably be a touch annoying, still, but it's not clear how often we'll need to do that: there isn't a combinatorial effect here, since kubewatch has a really simple job.

Adding new configmgr tests will require generating a new event stream, which could kind of suck unless we build some reasonable UX into Ambassador to grab real-world stuff. More discussion here is likely going to be relevant. The actual mechanics of adding the test, though, will be a matter of "dump a couple of text files into a new directory", and that will hopefully simplify things.

kflynn commented 6 years ago

@plombardi89 @rhs We should look over this next week and make sure that we all agree that the CoS have been met.

rhs commented 6 years ago

Phil, Flynn, and I had a meeting to discuss this issue and came up with the following items and areas to explore:

tactical speedup of the existing e2e test suite via parallelization
make it easy to plug a different version of envoy into the test suite (move envoy into it's own container)
drive translation unit tests in a way that captures behavior in a manner that can be tested both e2e and with diffs
streamline developer experience around above item
separate dynamic tests (behavior is correct around configuration changes) from static tests (behavior for a given input is correct in a steady state)
explore a way to run static tests super fast (e.g. not much longer than the number of curls needed to poke the mapping permutations)

We agreed that the above items represent a good summary/breakdown of what is discussed in this issue. The plan is to create separate issues for each of these items, better define them in their own issues, and then close off this issue when the new issues exist.

@plombardi89, @kflynn please shout if this summary is inaccurate or incomplete.

plombardi89 commented 6 years ago

Looks accurate to me. Attaching Epic label.

kflynn commented 6 years ago

Thinking a bit about how we actually get this done, note that we have four major things we want to do that are all interrelated:

make E2E faster and more easily extensible
make the IR easier to work with when adding new features
switch to V2 config
switch to ADS

There are some more minor things that play into this, too:

use arbitrary secrets for TLS certificates
watch changing secrets and update TLS certificates without needing a restart

I'm not going to recap the reasons why we want all these things, but the dependencies between bits are very relevant:

The IR changes are pretty much independent of everything else.
There is tactical E2E work that doesn't depend on anything else.
ADS, TLS-secret work, and the longer-term E2E work all depend on splitting up kubewatch as described my earlier comment.
ADS depends on V2.
V2 will be very hard without some work on E2E (and probably unit tests too).

So my suggestion for implementation:

Tactical E2E stuff. This is already mostly done: it splits E2E into a serial section and a parallel section, and runs the tests in the parallel section in parallel. This should actually help quite a bit.
Split kubewatch and configmgr as described in my earlier comment:
- in addition to the obvious code changes, this involves some significant test changes:
  - for every E2E test, we can generate one or more snapshots by feeding the K8s resources into kubewatch
  - for each snapshot, we can generate an Envoy config and an IR dump by feeding the snapshot into configmgr
  - for each Envoy config and IR dump, we can verify that it matches the gold files the E2E tests already have
  - we can vet this entire process by running the actual functional E2E test
- This is a great point to formalize the process of taking K8s resources and capturing them for testing.
  - we may need some diagd help for this.
After the kubewatch/configmgr split, there are a few different things we can do:
- Split E2E and Envoy tests:
  - E2E tests can run often, and work by running the K8s resource => snapshot => Envoy config pipeline above, without a Kube cluster or Envoy involved.
  - Envoy tests act like our current E2E tests.
- More TLS-secret work:
  - move secret handling entirely into kubewatch, and let configmgr look only at snapshots that just include certs as synthesized resources
    - in theory, the existing snapshots/gold file should provide 100% validation of this change
  - allow kubewatch to watch for updates, and update snapshots with them
    - this will require adding new tests, of course
- V2:
  - teach config.py how to translate snapshots into either V1 or V2
    - in theory, this is easy, since it's all post-IR work
  - validate V2 configs by:
    - comparing them to translated V1 configs
    - in theory, the V1->V2 config can be done purely mechanically, by code that is not part of config.py
    - comparing them to config_dump output from Envoy
    - at present, config_dump only covers part of the config, so this isn't good enough by itself, but it's a valuable assist
    - running the Envoy tests with the generated V2 configs
  Note that we could do V2 support without the E2E work, but having the E2E work done will make it quite a bit easier to test the V2 work, since we'll be able to focus on individual areas more effectively.
IR happens... somewhere? maybe in my copious spare time?
- The important thing here is that it should be able to happen on a branch by itself, without messing up everyone else.

plombardi89 commented 6 years ago

The debugability of the E2E suite is brutal. I have no idea why at times something failed or what it failed for. The mixture of error logging and debug logging that doesn't indicate which is which doesn't help.

For example is this an error, warning or debug? Some pods have yet to start?

kflynn commented 6 years ago

@plombardi89 Without disagreeing with you, the exit code is the gold standard. Some pods have yet to start? should be followed immediately by exiting with status 1, as it is indeed an error. Is that not what happens?

plombardi89 commented 6 years ago

try 21: 1 not running
try 20: 1 not running
try 19: 1 not running
try 18: 1 not running
try 17: 1 not running
try 16: 1 not running
try 15: 1 not running
try 14: 1 not running
try 13: 1 not running
try 12: 1 not running
try 11: 1 not running
try 10: 1 not running
try 09: 1 not running
try 08: 1 not running
try 07: 1 not running
try 06: 1 not running
try 05: 1 not running
try 04: 1 not running
try 03: 1 not running
try 02: 1 not running
try 01: 1 not running
Some pods have yet to start?
================ end captured output

================================================================
1: 005-single-namespace...

plombardi89 commented 6 years ago

and it keeps going and going...

kflynn commented 6 years ago

Notes from the meeting we're in the middle of:

Part of the development process for Ambassador is, of course, verifying that the Envoy configurations produced by Ambassador for a new feature or bugfix actually cause Envoy to behave as intended. An important part of the the Brave New World described here is that the actual Ambassador configurations (as they appear in Kubernetes resources) used for this functional testing become unit tests, in every case, so that revalidating the behavior when we change versions of Envoy is automated.
Note also that the split between kubewatch and configmgr is meant to speed up tests (by not requiring us to run everything in Kubernetes all the time) and therefore to permit broader coverage (since we can run more input permutations in the same amount of time).
We're planning to do V2 before ADS, because
- V2 + hot restart still unblocks us on V2-only features;
- We probably want to test ADS quite a lot;
- We need to put some more thought into ADS architecture.
- We may be able to start testing a simple ADS implementation with a hand-coded V2 config before the configmgr can generate V2.
  - (Flynn needs to think about how to mock stuff to permit this.)
Running kubewatch and configmgr as two processes is probably less robust than splitting them into classes and having a single process wrangle things.
- (Realistically, we'll probably have classes for the Kube interface, the config management, and the Envoy interface.)

Next steps

Flynn to generate interface specs
- Data paths and state definition, rather than threads or processes
- Major components:
  - Kube interface
    - Watch changes in K8s
    - Extract Ambassador config objects
    - Publish Ambassador config objects
    - How exactly this happens is probably up to the consumer
  - Config to IR compiler
    - Handed a set of Ambassador config objects
    - Incremental or all at once
    - Maintains an IR
  - IR to Envoy compiler
    - Handed an IR
    - Builds an Envoy config
    - Must support V1 and V2, eventually
  - IR to Diagnostics
    - Handed an IR
    - Builds diag report
  - Envoy interface
    - Handed an Envoy config
    - Makes Envoy use that config
    - Must support both hot restart and ADS, eventually
  - Probably worth talking about entry points too:
    - Some long-running process in Kubernetes
    - consumes all pieces
    - separate diag process? Probably no point now
    - CLIs for testing
    - definitely for config-to-IR, IR-to-diag, and IR-to-Envoy
    - probably for Kube i/f and Envoy i/f too
    - Others?
Shubham to generate example DevEx for working through dev process and having it magically be a regression test

concaf commented 6 years ago

I am putting some thoughts around DevEx at https://docs.google.com/document/d/1tynG8yldeIUdFXP80_1Jd1HHh39El2mwyLNvGVDQLDw/edit

Feel free to review and comment

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

emissary-ingress / emissary

Document and speed up E2E test harness #454

Design CoS

Implementation CoS

Additional Nice To Haves

Ambassador Testing

Functional Units

Current Testing

End-To-End: Current State

End-To-End: Looking Forward

End-to-End Plan of Attack

Next steps