pasta: IPv4 large transfer loopback: socat waitpid no child process

edsantiago commented 4 months ago

         not ok 187 |505| UDP%2FIPv4 large transfer, loopback in 3178ms
...
<+371ms> # $ podman run --rm --name=c-socat-t187-j170t3a5 --net=pasta -p [127.0.0.1]:5298:5298/udp quay.io/libpod/testimage:20240123 sh -c for port in $(seq 5298 5298); do                              socat -u UDP4-LISTEN:${port},bind=[127.0.0.1],null-eof EXEC:md5sum &                          done; wait
<+2.11s> # 70aef73d20b2fa83e5f3f6b04419c2c6  -
         # 2024/08/01 17:02:29 socat[3] W waitpid(0, 0x7ffc317a676c, 0): No child process
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #|     FAIL: Mismatch between data sent and received
         # #| expected: = 70aef73d20b2fa83e5f3f6b04419c2c6  -
         # #|   actual:   70aef73d20b2fa83e5f3f6b04419c2c6  -
         # #|         >   2024/08/01 17:02:29 socat\[3\] W waitpid\(0\, 0x7ffc317a676c\, 0\): No child process
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Only in my parallel-systest PR

fedora-39 : sys podman fedora-39 rootless host boltdb
- PR #23275
- 08-01 13:03 in [sys] |505| UDP%2FIPv4 large transfer, loopback
fedora-40 : sys podman fedora-40 rootless host sqlite
- PR #23275
- 07-29 10:31 in [sys] |505| UDP%2FIPv4 large transfer, loopback

x	x	x	x	x	x
sys(2)	podman(2)	fedora-39(1)	rootless(2)	host(2)	boltdb(1)
		fedora-40(1)			sqlite(1)

sbrivio-rh commented 4 months ago

The test actually succeeds, but we include socat's stderr in the output string. That waitpid() doing nothing appears harmless to me, so I wonder if we should just drop socat's stderr here.

Redirecting socat's stderr to /dev/null is not great, though, because if there's an actual error we would have a hard time finding it.

I wonder if we should use another variable other than run_podman()'s $output, or tweak run_podman() to avoid merging stderr and stdout (my understanding of it is that it does that, at the moment).

Luap99 commented 4 months ago

the merging of $output is done by bats by default although newer bats version have --separate-stderr for run that can be used which would need to be added to run_podman

But in general the question is why do we get this warning in the first place? Is this a bug in socat?

sbrivio-rh commented 4 months ago

But in general the question is why do we get this warning in the first place? Is this a bug in socat?

I didn't reproduce this, so it's all a bit tentative, but I think xioshutdown() (other waitpid() warnings are printed differently) is called "too late", once the receiving child process (in the container) already terminated.

Which is fine, as xioshutdown() just needs to ensure that the child process terminates... but yes, it prints a warning. Perhaps we could fix this in socat by omitting that warning if the error is ECHILD, say:

diff --git a/xioshutdown.c b/xioshutdown.c
index c9fef8d..9290be6 100644
--- a/xioshutdown.c
+++ b/xioshutdown.c
@@ -134,8 +134,10 @@ int xioshutdown(xiofile_t *sock, int how) {
     Alarm(1 /*! sock->stream.para.exec.waitdie */);
 #endif /* !HAVE_SETITIMER */
     if (Waitpid(sock->stream.para.exec.pid, &status, 0) < 0) {
-       Warn3("waitpid("F_pid", %p, 0): %s",
-         sock->stream.para.exec.pid, &status, strerror(errno));
+       if (errno != ECHILD) {
+          Warn3("waitpid("F_pid", %p, 0): %s",
+                sock->stream.para.exec.pid, &status, strerror(errno));
+            }
     }
     Alarm(0);
       }

but I would need to find out how to reproduce this, before being confident enough that the patch is correct.

Luap99 commented 4 months ago

right but in order to fail with ECHILD it means socat must have called waitpid more often than it has child processes which sounds like the real bug to me. OF course hiding the warning on ECHILD is simple.

As far as reproducing goes check out PR https://github.com/containers/podman/pull/23275 and run hack/bats --rootless --tag=ci:parallel -T 505

I see https://github.com/containers/podman/issues/23471 (maybe directly related?) on basically every run locally on my 12 thread laptop. I think I might have seen this issue as well but not in the few runs I tried today so this one might be a bit more difficult to trigger.

And @edsantiago I am also seeing a ton of Failed to bind port 5328 (Address already in use) flakes so I don't think the random_free_port logic based on job slot is working...

dgibson commented 4 months ago

As far as reproducing goes check out PR #23275 and run hack/bats --rootless --tag=ci:parallel -T 505

I see #23471 (maybe directly related?) on basically every run locally on my 12 thread laptop. I think I might have seen this issue as well but not in the few runs I tried today so this one might be a bit more difficult to trigger.

Tried this (on commit 67ebec1ae19141f56b390df334a2f0b104283b35 specifically), but was not able to reproduce (on my 20 thread laptop). I did once see what I think is not the original bug from #23471, but the TCP problem described later.

So, I don't really enough information to guess what's going wrong here.

@Luap99 any chance you could try to get an strace -f of the failing testcase? There's a pretty high chance that will change the timing enough that it won't reproduce any more, but it's worth a shot.

edsantiago commented 3 months ago

Still happening with pasta 08-21

Luap99 commented 3 months ago

Yeah I don't think this has anything to do with the podman or pasta version. To me this sounds like a bug in socat where it tries to reap the child process twice. Looking at the socat code there is both a signal handler for SIGCHLD that reaps childs and then the waitpid call on shutdown for the exec pid which reaps again. Thus I would assume if the handler runs before the waipid call on shutdown we see this warning. So I agree with @sbrivio-rh patch above as this is harmless and we likely should just work around that for now in the tests.

containers / podman

pasta: IPv4 large transfer loopback: socat waitpid no child process #23482