Unreliable Integration tests

cevich commented 5 years ago

/kind bug

a lot more flakes popping up recently

FWIW: I'm pretty sure this is due to races on critical shared resources (iptables, CNI, etc.) which aren't properly locked for parallel testing. It loosely coincides with enabling 3-way parallel testing (among other problems) - so it's been hard for me to gather hard-evidence :frowning_face: Let's call it a "strong hunch" based on my years of experience in dangerous situations :grin:

I'm sensitive that slowing down testing is also undesirable, and I know @baude worked hard on enabling this (among fixing many other test problems).

My only solution is to remove just the -nodes 3 part, until we can work out a way of splitting up tests across a few new VMs instead. That's guaranteed to never race, and we can be lazy about resource-locking (making test writing/maintaining easier).

baude commented 5 years ago

I've seen nothing to suggest it is parallel testing. I have seen network flakes, registry flakes, and the like. Can we collect test failures to show this?

One thing I will agree on; parallelization has made bad tests stick out like a pimple. I stumbled on four tests today that never actually ran :(

cevich commented 5 years ago

@baude I suspected there were other issues, and am glad to hear your on top of them. Unfortunately that taints the data-set :confused:

So it's an open issue for discussion, I think there are several distinct aspects:

Gathering data on parallelism-induced failures is particularly challenging because:
- Test-order is randomized
- We don't collect logs in an easily searchable way (across many jobs)
- Failures prior to re-runs by users in their PR's don't show up
Long-term maintenance of tests that run in parallel brings a heavy cost
- Skilled developers must always remember to properly lock all shared system resources
- New developers must learn about locking (document all new and existing utility functions)
- Debugging output from ginkgo is already "challenged" (i.e. it has to be added explicitly), i.e. hard to know 'what else was happening when some failure happened'

I'm not saying 'no' we must not do this, just want to raise concerns in the open. Those are the main things from my POV / experience, I'm sure there are other downsides. We also have to weigh all that vs the drastic speed improvement (multiplied by the amount of work speedup it enables).

cevich commented 5 years ago

Also to be perfectly transparent: I'm heavily bias: fighting a cold and loath to go trolling through the data myself.

cevich commented 5 years ago

Some examples (my analysis could be wrong):

podman run -d mount cleanup test test/e2e/run_cleanup_test.go out1 and out2 will race with parallel tests that alter host mount namespace
podman system info json output Racing the clock, 1-second timeout not always sufficient.

(will add more as I find them)

baude commented 5 years ago

the second one is fixed upstream; in fact, both @mheon and i pushed fixes for it ;) bad test

cevich commented 5 years ago

@baude ok good. I couldn't find it yesterday, but I remember seeing one race with iptables. One test was modifying/checking them while a background test was manipulating them.

I guess my main concerns here are about the additional burden required to debug and maintain parallel tests. I've rarely seen a case of more complexity leading to more reliability, without also adding a lot more brain-strain along the way.

IMHO, it's advantageous on many levels, to keep tests as simple and idiot-proof as possible. I'd much rather increase the complexity of the automation, if sacrifices need to be made.

Along those lines: Another thought I just had: It's "less" of a pain to have long-running tests as part of a final-draft, ready-for-review PR. What if I added a test-control mechanism, whereby during WIP/Draft status, the author can select and focus on a specific set of tests for speedyness?

It could work like this:

Cirrus-CI already passes down, an environment variable containing the PR description text.
Automation scripts could grep that text for magic / keywords.
It's right at the top of the PR, so everyone can see it's there (or absent).
If no magic is used, we run the complete suite in serial (i.e. w/o '-nodes 3').

Some ideas of magic:

An option to ONLY run gate, unit, integration or system tests.
An option to run integration tests w/ additional ginkgo CLI flags (like '-nodes 999').
An option to focus on a specified Distro. (like "only-fedora-29").

eh?

cevich commented 5 years ago

Here's another iptables one (not sure if it's the same one I saw before)

Podman run passwd running [/usr/sbin/iptables -t filter -N CNI-FORWARD --wait]: exit status 1: iptables: Chain already exists.

mheon commented 5 years ago

That sounds like it could be a concurrency thing, though out of CNI, not us.

Thought: did anyone ask CNI if they're capable of being as multiprocess as we want in Podman? If their locking isn't up to par it could be problematic. On the other hand, it hasn't bitten us so far...

cevich commented 5 years ago

Podman privileged container tests - in [BeforeEach] seeing error: Error: error adding firewall rules for container. exit status 1: iptables: Chain already exists.

Similar to Podman run passwd (above)

cevich commented 5 years ago

---resented---

cevich commented 5 years ago

Closing, but will refer back to this issue in lieu of related "I told you so" opotrunities :grin:

containers / podman

Unreliable Integration tests #2649