Open edsantiago opened 4 months ago
The test actually succeeds, but we include socat's stderr in the output string. That waitpid() doing nothing appears harmless to me, so I wonder if we should just drop socat's stderr here.
Redirecting socat's stderr to /dev/null is not great, though, because if there's an actual error we would have a hard time finding it.
I wonder if we should use another variable other than run_podman()'s $output, or tweak run_podman() to avoid merging stderr and stdout (my understanding of it is that it does that, at the moment).
the merging of $output
is done by bats by default although newer bats version have --separate-stderr
for run that can be used which would need to be added to run_podman
But in general the question is why do we get this warning in the first place? Is this a bug in socat?
But in general the question is why do we get this warning in the first place? Is this a bug in socat?
I didn't reproduce this, so it's all a bit tentative, but I think xioshutdown() (other waitpid() warnings are printed differently) is called "too late", once the receiving child process (in the container) already terminated.
Which is fine, as xioshutdown() just needs to ensure that the child process terminates... but yes, it prints a warning. Perhaps we could fix this in socat by omitting that warning if the error is ECHILD, say:
diff --git a/xioshutdown.c b/xioshutdown.c
index c9fef8d..9290be6 100644
--- a/xioshutdown.c
+++ b/xioshutdown.c
@@ -134,8 +134,10 @@ int xioshutdown(xiofile_t *sock, int how) {
Alarm(1 /*! sock->stream.para.exec.waitdie */);
#endif /* !HAVE_SETITIMER */
if (Waitpid(sock->stream.para.exec.pid, &status, 0) < 0) {
- Warn3("waitpid("F_pid", %p, 0): %s",
- sock->stream.para.exec.pid, &status, strerror(errno));
+ if (errno != ECHILD) {
+ Warn3("waitpid("F_pid", %p, 0): %s",
+ sock->stream.para.exec.pid, &status, strerror(errno));
+ }
}
Alarm(0);
}
but I would need to find out how to reproduce this, before being confident enough that the patch is correct.
right but in order to fail with ECHILD it means socat must have called waitpid more often than it has child processes which sounds like the real bug to me. OF course hiding the warning on ECHILD is simple.
As far as reproducing goes check out PR https://github.com/containers/podman/pull/23275 and run
hack/bats --rootless --tag=ci:parallel -T 505
I see https://github.com/containers/podman/issues/23471 (maybe directly related?) on basically every run locally on my 12 thread laptop. I think I might have seen this issue as well but not in the few runs I tried today so this one might be a bit more difficult to trigger.
And @edsantiago I am also seeing a ton of Failed to bind port 5328 (Address already in use)
flakes so I don't think the random_free_port logic based on job slot is working...
As far as reproducing goes check out PR #23275 and run
hack/bats --rootless --tag=ci:parallel -T 505
I see #23471 (maybe directly related?) on basically every run locally on my 12 thread laptop. I think I might have seen this issue as well but not in the few runs I tried today so this one might be a bit more difficult to trigger.
Tried this (on commit 67ebec1ae19141f56b390df334a2f0b104283b35
specifically), but was not able to reproduce (on my 20 thread laptop). I did once see what I think is not the original bug from #23471, but the TCP problem described later.
So, I don't really enough information to guess what's going wrong here.
@Luap99 any chance you could try to get an strace -f
of the failing testcase? There's a pretty high chance that will change the timing enough that it won't reproduce any more, but it's worth a shot.
Still happening with pasta 08-21
Yeah I don't think this has anything to do with the podman or pasta version. To me this sounds like a bug in socat where it tries to reap the child process twice. Looking at the socat code there is both a signal handler for SIGCHLD that reaps childs and then the waitpid call on shutdown for the exec pid which reaps again. Thus I would assume if the handler runs before the waipid call on shutdown we see this warning. So I agree with @sbrivio-rh patch above as this is harmless and we likely should just work around that for now in the tests.
Only in my parallel-systest PR