Spurious failures in CI

frozencemetery commented 5 years ago

To my eye these look mostly to be on the rawhide builders, but I'm told that's accidental. There's a port collision when setting up the test suite - we may try to use an already-in-use port and get sporadic failures.

npmccallum commented 5 years ago

Let me explain how this works and then the reason for the problem should be made clear.

Meson parallelizes all tests. This means that each test must run on a different port to prevent port collisions.

My initial (naive) attempt at working around this problem was to select a random port using the PID of the process as entropy. Obviously, we are sporadically getting PID reuse by the kernel. This causes the new test to reuse the port from an old test before it is cleared. We have several options to fix this.

If Meson can provide us a "test number" in the execution environment, we could use this to guarantee each test runs on a separate port.
We could use additional input to srand() to further qualify the entropy. One option would be a timestamp. Note that this wouldn't guarantee no port collision, since rand() could still produce the same results with different inputs to srand().
We could use setsockopt(SO_REUSEPORT) to allow the processes to reuse the port. But I'm not sure if this might cause other unwanted behavior.
We could use the NETLINK_SOCK_DIAG family from netlink(7) to get the list of ports in use and call rand() in a loop until we get an unused port. This option seems like a lot of work and also likely to cause race conditions.

npmccallum commented 5 years ago

@yingxiongraomingzk We can do option number 2 as a short-term fix to make collision less likely. It is easy to implement and provides a lot of value while we research other options.

frozencemetery commented 5 years ago

If Meson can provide us a "test number" in the execution environment, we could use this to guarantee each test runs on a separate port.

I'm not aware of any meson facility to do this.

We could use additional input to srand() to further qualify the entropy. One option would be a timestamp. Note that this wouldn't guarantee no port collision, since rand() could still produce the same results with different inputs to srand().

This is probably fine.

We could use setsockopt(SO_REUSEPORT) to allow the processes to reuse the port. But I'm not sure if this might cause other unwanted behavior.

If they're spawned in parallel, won't this cause them to step on each other while potentially active?

We could use the NETLINK_SOCK_DIAG family from netlink(7) to get the list of ports in use and call rand() in a loop until we get an unused port. This option seems like a lot of work and also likely to cause race conditions.

Agreed.

One other suggestion: remove the parallelization entirely. I think this is doable with a file lock: have each run instance take at the start and release when finished.

npmccallum commented 5 years ago

I merged @yingxiongraomingzk's commit. Hopefully it will help. If the spurious failures don't return, we can close this bug.

enarx-archive / tlssock

Spurious failures in CI #15