Closed edsantiago closed 9 months ago
I wonder if this could be a different manifestation of the same bug? Basically, podman runs a systemd container, spins waiting for "Reached target multi-user.target
", and test fails because the string never appears in podman logs. This is podman-remote
, though, so it could be a different issue.
I see some weird output formatting in the json-file case:
...
line 424
line 425
line 426
...
However I also see this in tests which pass so I am not sure if this is caused by ginkgo or the logformatter maybe? I also counted the lines in the log and see 800 so why does the matcher say 799???
The journal case is more concerning, we are actually missing 261 lines in the log.
In any case each line also prints a \r
because we start the container with -t
should we drop that?
As a general rule I like removing -t
from tests; but ISTR that podman logs
can behave differently with -t
, and I don't know if there's a good reason for the -t
in this test. (From a quick tig blame
session, I don't think so, but I can't be sure). I think it's worth looking into removing -t
here.
yeah with -t
we only have one stream instead of separate stdout/err streams. However for the 800 line test, which I added a long time ago, the point was to test logs --tail
with a size greater than the default pagesize (4096 bytes usually) read size. https://github.com/containers/podman/pull/7232
Repurposing this issue as a gathering place for all "podman logs missing output" flakes.
Hey, is anyone looking into this? It's a serious problem, hitting a lot of different tests (suggesting it's not a badly written test). I would hate for customers to start hitting this in the field.
I am flooded with upstream issues and bugs at the moment but totally agree with you Ed.
@mheon, can we schedule a bug week soon-ish?
It's bookmarked, so I will take a look when I have some cycles (no ETA).
Some tests are definitely broken (missing wait or should not use run -d, i.e. podman pod logs -l
)
The 800 lines thing is interesting because the output is screwed up with the \r
bytes somehow: https://api.cirrus-ci.com/v1/artifact/task/6317242757939200/html/int-podman-debian-12-root-host-sqlite.log.html#t--tail-800-lines--json-file--1
But it does the same thing in tests that pass so I don't think that is the cause. And even more weird it actually shows the full 800 lines output in the log so why does ginkgo say it is only 799 lines?!!!!!!
This one is odd, looks like the linker is broken: https://api.cirrus-ci.com/v1/artifact/task/6365092854366208/html/int-podman-debian-12-rootless-host-sqlite.log.html#t--podman-remote-pod-logs-test--1 But then again other tests in this run pass so why does loading a shared lib flake with ENOENT????????
Oh, I didn't look closely enough at that one. It's probably #17042, another scary-nasty flake. I've reclassified it.
Here's one from yesterday, 800-line test, journald. It's a humongous failure, expecting 800 lines, getting 551. Looking at the journal, search in-page for "line 551
", we find:
Jun 26 16:46:46 cirrus-task-6755306202464256 eloquent_williamson[223648]: line 551
Jun 26 16:46:46 cirrus-task-6755306202464256 eloquent_williamson[223648]: line 552
Jun 26 16:46:47 cirrus-task-6755306202464256 systemd-journald[557]: Data hash table of /var/log/journal/eedc8417f8cb3f847c826e088dfd67ab/system.journal has a fill level at 75.0 (174763 of 233016 items, 50331648 file size, 287 bytes per hash table item), suggesting rotation.
Jun 26 16:46:47 cirrus-task-6755306202464256 systemd-journald[557]: /var/log/journal/eedc8417f8cb3f847c826e088dfd67ab/system.journal: Journal header limits reached or header out-of-date, rotating.
........
Jun 26 16:46:46 cirrus-task-6755306202464256 rsyslogd[666]: imjournal: journal files changed, reloading... [v8.2210.0-4.fc38 try https://www.rsyslog.com/e/0 ]
Jun 26 16:46:46 cirrus-task-6755306202464256 eloquent_williamson[223648]: line 553
Jun 26 16:46:47 cirrus-task-6755306202464256 rsyslogd[666]: imjournal: journal files changed, reloading... [v8.2210.0-4.fc38 try https://www.rsyslog.com/e/0 ]
Looks like a smoking gun to me.
Yeah that is definitely suspicious and in fact we miss the logs after that line but there are also other test failures were we actually saw all lines in the log so I am not sure it is the only bug.
The new logformatter adds direct links to journal, making it much easier to look into these. So like here's one of those 800-line failures, k8s-file
(not journald), and it looks like all the lines are there (I didn't count) ... but journal shows a conmon error:
Jun 15 16:51:11 cirrus-task-6083635929939968 conmon[93450]: conmon 6a82177ee9e67f96f12c <nwarn>: Failed to open cgroups file: /sys/fs/cgroup/machine.slice/libpod-6a82177ee9e67f96f12c938169b73c16537188fa8d712e151d2f7a2505ee21aa.scope/container/memory.events
Jun 15 16:51:11 cirrus-task-6083635929939968 peaceful_bhaskara[93450]: line 488
Jun 15 16:51:11 cirrus-task-6083635929939968 conmon[93450]: conmon 6a82177ee9e67f96f12c <nwarn>: stdio_input read failed Input/output error
Ok I guess the first step is to actually make ginkgo print the full matcher output so we can at least know which element is missing. Because AFAICS podman logs printed all lines we would expect.
Here's an interesting variation: remote f38 root. The interesting thing about it is journalctl
being run, and not seeing any output (vs podman logs
).
A friendly reminder that this issue had no activity for 30 days.
Oops, I thought I had posted a ha-ha-very-funny-github-actions-bot comment last night. Guess I failed to press the button. Anyhow, ha ha, very funny github-actions-bot:
This is an interesting variation on the theme:
[+1297s] not ok 375 quadlet - volume path using systemd %T specifier
...
# # journalctl -u get-percent-t.0h8sI9L7lp.service
# Aug 02 05:00:17 cirrus-task-6648908303040512 systemd[1]: Started get-percent-t.0h8sI9L7lp.service - Get the value of percent T.
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: get-percent-T unit ran to completion
# #| expected: =~ --==.\*==--
# #| actual: Aug 02 05:00:17 cirrus-task-6648908303040512 systemd\[1\]: Started get-percent-t.0h8sI9L7lp.service - Get the value of percent T.
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Basically: systemctl --wait start UNIT
, then journalctl -u UNIT
, and the output from UNIT is not seen. Podman is not in the picture anywhere, suggesting that maybe the bug is in systemd/journald?
well on the good side the ones you linked look all journald related so journald bugs are very much possible.
From https://api.cirrus-ci.com/v1/task/6648908303040512/logs/journal.log:
Aug 02 05:00:17 cirrus-task-6648908303040512 systemd[1]: Started get-percent-t.0h8sI9L7lp.service - Get the value of percent T.
Aug 02 05:00:17 cirrus-task-6648908303040512 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=get-percent-t.0h8sI9L7lp comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 02 05:00:17 cirrus-task-6648908303040512 systemd[1]: get-percent-t.0h8sI9L7lp.service: Deactivated successfully.
Aug 02 05:00:17 cirrus-task-6648908303040512 bash[212322]: --==/tmp==--
The line is there but only after the service is logged as Deactivated successfully
which seem weird. So assuming systemctl --wait start
returns before we have the line appear in journald it seems entirely possible that there are race conditions in systemd/journald regarding logging.
Yep, that's it. Reproduces in O(minutes) with this trivial reproducer:
# ./foo
.......................**FAILED foo-iiEcv4ZQKXkSceKxRuZI.service
-- No entries --
retrying after 2 seconds:
Aug 02 10:23:48 vm-10-0-186-124.hosted.upshift.rdu2.redhat.com systemd[1]: Started foo-iiEcv4ZQKXkSceKxRuZI.service - foo.
Aug 02 10:23:48 vm-10-0-186-124.hosted.upshift.rdu2.redhat.com systemd[1]: foo-iiEcv4ZQKXkSceKxRuZI.service: Deactivated successfully.
Aug 02 10:23:48 vm-10-0-186-124.hosted.upshift.rdu2.redhat.com bash[81070]: GOT HERE
I will now go look into systemd bugs and see what I can find. Wish me luck.
:crossed_fingers:
Update: journald does not make any guarantees; I feel so stupid for thinking it would. I will start working on fixing the tests that rely on journalctl.
So, it's not just journald: tests fail in json-file
too: f38 rootless, rawhide rootless. We've seen json-file
flakes here before, but most failures are journald
. Two json
flakes in one run seems unusual.
Does this help chase down the bug?
I have a reproducer with json-file
, nothing special needed, just a laptop (mine is f38):
$ nl=" <----- basically doublequote, enter, doublequote, enter
"
$ cr="^M" <-------- control-V for quote-next-character, control-M for CR
$ while :;do cid=$(bin/podman run --log-driver json-file -dt quay.io/libpod/testimage:20221018 sh -c "echo podman;echo podman;echo podman");bin/podman wait $cid;logs=$(bin/podman logs --tail 2 $cid); test "$logs" = "podman${cr}${nl}podman${cr}" || break;bin/podman rm $cid;done
...
0
$ echo "$logs"
podman
$ echo "$logs" | cat -vET
podman^Mpodman$ <<<---------- Ooooh, isn't this interesting!
^M$
$ bin/podman logs $cid | cat -vET
podman^M$
podman^M$
podman^M$
Another failure (it doesn't take long). More info:
$ echo "$logs"|cat -vET
^M$
podman^M$
$ bin/podman logs $cid | cat -vET
podman^M$
podman^M$
podman^M$
$ bin/podman logs --tail 2 $cid | cat -vET
^M$
podman^M$ <<<<---- OK, at least it's consistent
$ cat -vET ~/.local/share/containers/storage/overlay-containers/$cid/userdata/ctr.log
2023-08-07T19:56:34.223758260-06:00 stdout F podman^M$
2023-08-07T19:56:34.223758260-06:00 stdout P podman$
2023-08-07T19:56:34.223816701-06:00 stdout F ^M$
2023-08-07T19:56:34.223816701-06:00 stdout F podman^M$
I doubt that this is connected in any way to the journald bug, since that one reproduces easily without podman or conmon anywhere (or even installed), but this is still a bug.
Thanks for the reproducer, yes this seems to be a bug in our log reader. Basically with --tail we read last N lines from the file which breaks when it contains partial (P) lines. If we read the file backwards we see two full (F) lines so it looks good but F basically only means append newline, the correct way would be to continue reading backwards until the next F line and we must keep all partial lines in between as well.
Yes, precisely. Filed #19545.
Hi, this is still a big source of frustration. Any new insights?
Seen in: int+sys podman+remote debian-13+fedora-37+fedora-38 root+rootless host boltdb+sqlite
Different CI runs on different days. root/rootless, f37/38, journald/json-file
Echoes of #14362, except these failures are on
main
and include thewait
fix.