containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.35k stars 2.38k forks source link

Unreliable Integration tests #2649

Closed cevich closed 5 years ago

cevich commented 5 years ago

/kind bug

From a semi-related pull-request @mheon said:

a lot more flakes popping up recently

FWIW: I'm pretty sure this is due to races on critical shared resources (iptables, CNI, etc.) which aren't properly locked for parallel testing. It loosely coincides with enabling 3-way parallel testing (among other problems) - so it's been hard for me to gather hard-evidence :frowning_face: Let's call it a "strong hunch" based on my years of experience in dangerous situations :grin:

I'm sensitive that slowing down testing is also undesirable, and I know @baude worked hard on enabling this (among fixing many other test problems).

My only solution is to remove just the -nodes 3 part, until we can work out a way of splitting up tests across a few new VMs instead. That's guaranteed to never race, and we can be lazy about resource-locking (making test writing/maintaining easier).

baude commented 5 years ago

I've seen nothing to suggest it is parallel testing. I have seen network flakes, registry flakes, and the like. Can we collect test failures to show this?

One thing I will agree on; parallelization has made bad tests stick out like a pimple. I stumbled on four tests today that never actually ran :(

cevich commented 5 years ago

@baude I suspected there were other issues, and am glad to hear your on top of them. Unfortunately that taints the data-set :confused:

So it's an open issue for discussion, I think there are several distinct aspects:

I'm not saying 'no' we must not do this, just want to raise concerns in the open. Those are the main things from my POV / experience, I'm sure there are other downsides. We also have to weigh all that vs the drastic speed improvement (multiplied by the amount of work speedup it enables).

cevich commented 5 years ago

Also to be perfectly transparent: I'm heavily bias: fighting a cold and loath to go trolling through the data myself.

cevich commented 5 years ago

Some examples (my analysis could be wrong):

(will add more as I find them)

baude commented 5 years ago

the second one is fixed upstream; in fact, both @mheon and i pushed fixes for it ;) bad test

cevich commented 5 years ago

@baude ok good. I couldn't find it yesterday, but I remember seeing one race with iptables. One test was modifying/checking them while a background test was manipulating them.

I guess my main concerns here are about the additional burden required to debug and maintain parallel tests. I've rarely seen a case of more complexity leading to more reliability, without also adding a lot more brain-strain along the way.

IMHO, it's advantageous on many levels, to keep tests as simple and idiot-proof as possible. I'd much rather increase the complexity of the automation, if sacrifices need to be made.

Along those lines: Another thought I just had: It's "less" of a pain to have long-running tests as part of a final-draft, ready-for-review PR. What if I added a test-control mechanism, whereby during WIP/Draft status, the author can select and focus on a specific set of tests for speedyness?

It could work like this:

Some ideas of magic:

eh?

cevich commented 5 years ago

Here's another iptables one (not sure if it's the same one I saw before)

mheon commented 5 years ago

That sounds like it could be a concurrency thing, though out of CNI, not us.

Thought: did anyone ask CNI if they're capable of being as multiprocess as we want in Podman? If their locking isn't up to par it could be problematic. On the other hand, it hasn't bitten us so far...

cevich commented 5 years ago

Similar to Podman run passwd (above)

cevich commented 5 years ago

---resented---

cevich commented 5 years ago

Closing, but will refer back to this issue in lieu of related "I told you so" opotrunities :grin: