Closed cevich closed 5 years ago
I've seen nothing to suggest it is parallel testing. I have seen network flakes, registry flakes, and the like. Can we collect test failures to show this?
One thing I will agree on; parallelization has made bad tests stick out like a pimple. I stumbled on four tests today that never actually ran :(
@baude I suspected there were other issues, and am glad to hear your on top of them. Unfortunately that taints the data-set :confused:
So it's an open issue for discussion, I think there are several distinct aspects:
Gathering data on parallelism-induced failures is particularly challenging because:
Long-term maintenance of tests that run in parallel brings a heavy cost
I'm not saying 'no' we must not do this, just want to raise concerns in the open. Those are the main things from my POV / experience, I'm sure there are other downsides. We also have to weigh all that vs the drastic speed improvement (multiplied by the amount of work speedup it enables).
Also to be perfectly transparent: I'm heavily bias: fighting a cold and loath to go trolling through the data myself.
Some examples (my analysis could be wrong):
podman run -d mount cleanup test
test/e2e/run_cleanup_test.go out1
and out2
will race with parallel tests that alter host mount namespace
podman system info json output Racing the clock, 1-second timeout not always sufficient.
(will add more as I find them)
the second one is fixed upstream; in fact, both @mheon and i pushed fixes for it ;) bad test
@baude ok good. I couldn't find it yesterday, but I remember seeing one race with iptables. One test was modifying/checking them while a background test was manipulating them.
I guess my main concerns here are about the additional burden required to debug and maintain parallel tests. I've rarely seen a case of more complexity leading to more reliability, without also adding a lot more brain-strain along the way.
IMHO, it's advantageous on many levels, to keep tests as simple and idiot-proof as possible. I'd much rather increase the complexity of the automation, if sacrifices need to be made.
Along those lines: Another thought I just had: It's "less" of a pain to have long-running tests as part of a final-draft, ready-for-review PR. What if I added a test-control mechanism, whereby during WIP/Draft status, the author can select and focus on a specific set of tests for speedyness?
It could work like this:
Some ideas of magic:
eh?
Here's another iptables one (not sure if it's the same one I saw before)
running [/usr/sbin/iptables -t filter -N CNI-FORWARD --wait]: exit status 1: iptables: Chain already exists.
That sounds like it could be a concurrency thing, though out of CNI, not us.
Thought: did anyone ask CNI if they're capable of being as multiprocess as we want in Podman? If their locking isn't up to par it could be problematic. On the other hand, it hasn't bitten us so far...
[BeforeEach]
seeing error: Error: error adding firewall rules for container. exit status 1: iptables: Chain already exists.Similar to Podman run passwd (above)
---resented---
Closing, but will refer back to this issue in lieu of related "I told you so" opotrunities :grin:
/kind bug
From a semi-related pull-request @mheon said:
FWIW: I'm pretty sure this is due to races on critical shared resources (iptables, CNI, etc.) which aren't properly locked for parallel testing. It loosely coincides with enabling 3-way parallel testing (among other problems) - so it's been hard for me to gather hard-evidence :frowning_face: Let's call it a "strong hunch" based on my years of experience in dangerous situations :grin:
I'm sensitive that slowing down testing is also undesirable, and I know @baude worked hard on enabling this (among fixing many other test problems).
My only solution is to remove just the
-nodes 3
part, until we can work out a way of splitting up tests across a few new VMs instead. That's guaranteed to never race, and we can be lazy about resource-locking (making test writing/maintaining easier).