Open frozencemetery opened 5 years ago
Let me explain how this works and then the reason for the problem should be made clear.
Meson parallelizes all tests. This means that each test must run on a different port to prevent port collisions.
My initial (naive) attempt at working around this problem was to select a random port using the PID of the process as entropy. Obviously, we are sporadically getting PID reuse by the kernel. This causes the new test to reuse the port from an old test before it is cleared. We have several options to fix this.
If Meson can provide us a "test number" in the execution environment, we could use this to guarantee each test runs on a separate port.
We could use additional input to srand()
to further qualify the entropy. One option would be a timestamp. Note that this wouldn't guarantee no port collision, since rand()
could still produce the same results with different inputs to srand()
.
We could use setsockopt(SO_REUSEPORT)
to allow the processes to reuse the port. But I'm not sure if this might cause other unwanted behavior.
We could use the NETLINK_SOCK_DIAG
family from netlink(7)
to get the list of ports in use and call rand()
in a loop until we get an unused port. This option seems like a lot of work and also likely to cause race conditions.
@yingxiongraomingzk We can do option number 2 as a short-term fix to make collision less likely. It is easy to implement and provides a lot of value while we research other options.
- If Meson can provide us a "test number" in the execution environment, we could use this to guarantee each test runs on a separate port.
I'm not aware of any meson facility to do this.
- We could use additional input to
srand()
to further qualify the entropy. One option would be a timestamp. Note that this wouldn't guarantee no port collision, sincerand()
could still produce the same results with different inputs tosrand()
.
This is probably fine.
- We could use
setsockopt(SO_REUSEPORT)
to allow the processes to reuse the port. But I'm not sure if this might cause other unwanted behavior.
If they're spawned in parallel, won't this cause them to step on each other while potentially active?
- We could use the
NETLINK_SOCK_DIAG
family fromnetlink(7)
to get the list of ports in use and callrand()
in a loop until we get an unused port. This option seems like a lot of work and also likely to cause race conditions.
Agreed.
One other suggestion: remove the parallelization entirely. I think this is doable with a file lock: have each run instance take at the start and release when finished.
I merged @yingxiongraomingzk's commit. Hopefully it will help. If the spurious failures don't return, we can close this bug.
To my eye these look mostly to be on the rawhide builders, but I'm told that's accidental. There's a port collision when setting up the test suite - we may try to use an already-in-use port and get sporadic failures.