containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.71k stars 2.41k forks source link

podman logs: missing output #18501

Closed edsantiago closed 9 months ago

edsantiago commented 1 year ago
 [It] tail 800 lines
...
podman [options] logs --tail 800 CID
...
[FAILED] Expected <LONG STRING> to have length 800

Different CI runs on different days. root/rootless, f37/38, journald/json-file

Echoes of #14362, except these failures are on main and include the wait fix.

edsantiago commented 1 year ago

I wonder if this could be a different manifestation of the same bug? Basically, podman runs a systemd container, spins waiting for "Reached target multi-user.target", and test fails because the string never appears in podman logs. This is podman-remote, though, so it could be a different issue.

Luap99 commented 1 year ago

I see some weird output formatting in the json-file case:

...
           line 424
line 425

           line 426
...

However I also see this in tests which pass so I am not sure if this is caused by ginkgo or the logformatter maybe? I also counted the lines in the log and see 800 so why does the matcher say 799???

The journal case is more concerning, we are actually missing 261 lines in the log.


In any case each line also prints a \r because we start the container with -t should we drop that?

edsantiago commented 1 year ago

As a general rule I like removing -t from tests; but ISTR that podman logs can behave differently with -t, and I don't know if there's a good reason for the -t in this test. (From a quick tig blame session, I don't think so, but I can't be sure). I think it's worth looking into removing -t here.

Luap99 commented 1 year ago

yeah with -t we only have one stream instead of separate stdout/err streams. However for the 800 line test, which I added a long time ago, the point was to test logs --tail with a size greater than the default pagesize (4096 bytes usually) read size. https://github.com/containers/podman/pull/7232

edsantiago commented 1 year ago

Repurposing this issue as a gathering place for all "podman logs missing output" flakes.

edsantiago commented 1 year ago

Hey, is anyone looking into this? It's a serious problem, hitting a lot of different tests (suggesting it's not a badly written test). I would hate for customers to start hitting this in the field.

vrothberg commented 1 year ago

I am flooded with upstream issues and bugs at the moment but totally agree with you Ed.

@mheon, can we schedule a bug week soon-ish?

vrothberg commented 1 year ago

It's bookmarked, so I will take a look when I have some cycles (no ETA).

Luap99 commented 1 year ago

Some tests are definitely broken (missing wait or should not use run -d, i.e. podman pod logs -l)


The 800 lines thing is interesting because the output is screwed up with the \r bytes somehow: https://api.cirrus-ci.com/v1/artifact/task/6317242757939200/html/int-podman-debian-12-root-host-sqlite.log.html#t--tail-800-lines--json-file--1 But it does the same thing in tests that pass so I don't think that is the cause. And even more weird it actually shows the full 800 lines output in the log so why does ginkgo say it is only 799 lines?!!!!!!


This one is odd, looks like the linker is broken: https://api.cirrus-ci.com/v1/artifact/task/6365092854366208/html/int-podman-debian-12-rootless-host-sqlite.log.html#t--podman-remote-pod-logs-test--1 But then again other tests in this run pass so why does loading a shared lib flake with ENOENT????????

edsantiago commented 1 year ago

Oh, I didn't look closely enough at that one. It's probably #17042, another scary-nasty flake. I've reclassified it.

edsantiago commented 1 year ago

Here's one from yesterday, 800-line test, journald. It's a humongous failure, expecting 800 lines, getting 551. Looking at the journal, search in-page for "line 551", we find:

Jun 26 16:46:46 cirrus-task-6755306202464256 eloquent_williamson[223648]: line 551
Jun 26 16:46:46 cirrus-task-6755306202464256 eloquent_williamson[223648]: line 552
Jun 26 16:46:47 cirrus-task-6755306202464256 systemd-journald[557]: Data hash table of /var/log/journal/eedc8417f8cb3f847c826e088dfd67ab/system.journal has a fill level at 75.0 (174763 of 233016 items, 50331648 file size, 287 bytes per hash table item), suggesting rotation.
Jun 26 16:46:47 cirrus-task-6755306202464256 systemd-journald[557]: /var/log/journal/eedc8417f8cb3f847c826e088dfd67ab/system.journal: Journal header limits reached or header out-of-date, rotating.
........
Jun 26 16:46:46 cirrus-task-6755306202464256 rsyslogd[666]: imjournal: journal files changed, reloading...  [v8.2210.0-4.fc38 try https://www.rsyslog.com/e/0 ]
Jun 26 16:46:46 cirrus-task-6755306202464256 eloquent_williamson[223648]: line 553
Jun 26 16:46:47 cirrus-task-6755306202464256 rsyslogd[666]: imjournal: journal files changed, reloading...  [v8.2210.0-4.fc38 try https://www.rsyslog.com/e/0 ]

Looks like a smoking gun to me.

Luap99 commented 1 year ago

Yeah that is definitely suspicious and in fact we miss the logs after that line but there are also other test failures were we actually saw all lines in the log so I am not sure it is the only bug.

edsantiago commented 1 year ago

The new logformatter adds direct links to journal, making it much easier to look into these. So like here's one of those 800-line failures, k8s-file (not journald), and it looks like all the lines are there (I didn't count) ... but journal shows a conmon error:

Jun 15 16:51:11 cirrus-task-6083635929939968 conmon[93450]: conmon 6a82177ee9e67f96f12c <nwarn>: Failed to open cgroups file: /sys/fs/cgroup/machine.slice/libpod-6a82177ee9e67f96f12c938169b73c16537188fa8d712e151d2f7a2505ee21aa.scope/container/memory.events
Jun 15 16:51:11 cirrus-task-6083635929939968 peaceful_bhaskara[93450]: line 488
Jun 15 16:51:11 cirrus-task-6083635929939968 conmon[93450]: conmon 6a82177ee9e67f96f12c <nwarn>: stdio_input read failed Input/output error
Luap99 commented 1 year ago

Ok I guess the first step is to actually make ginkgo print the full matcher output so we can at least know which element is missing. Because AFAICS podman logs printed all lines we would expect.

edsantiago commented 1 year ago

Here's an interesting variation: remote f38 root. The interesting thing about it is journalctl being run, and not seeing any output (vs podman logs).

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

edsantiago commented 1 year ago

Oops, I thought I had posted a ha-ha-very-funny-github-actions-bot comment last night. Guess I failed to press the button. Anyhow, ha ha, very funny github-actions-bot:

edsantiago commented 1 year ago

This is an interesting variation on the theme:

[+1297s] not ok 375 quadlet - volume path using systemd %T specifier
...
         # # journalctl -u get-percent-t.0h8sI9L7lp.service
         # Aug 02 05:00:17 cirrus-task-6648908303040512 systemd[1]: Started get-percent-t.0h8sI9L7lp.service - Get the value of percent T.
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #|     FAIL: get-percent-T unit ran to completion
         # #| expected: =~  --==.\*==--
         # #|   actual:    Aug 02 05:00:17 cirrus-task-6648908303040512 systemd\[1\]: Started get-percent-t.0h8sI9L7lp.service - Get the value of percent T.
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Basically: systemctl --wait start UNIT, then journalctl -u UNIT, and the output from UNIT is not seen. Podman is not in the picture anywhere, suggesting that maybe the bug is in systemd/journald?

Luap99 commented 1 year ago

well on the good side the ones you linked look all journald related so journald bugs are very much possible.

From https://api.cirrus-ci.com/v1/task/6648908303040512/logs/journal.log:

Aug 02 05:00:17 cirrus-task-6648908303040512 systemd[1]: Started get-percent-t.0h8sI9L7lp.service - Get the value of percent T.
Aug 02 05:00:17 cirrus-task-6648908303040512 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=get-percent-t.0h8sI9L7lp comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 02 05:00:17 cirrus-task-6648908303040512 systemd[1]: get-percent-t.0h8sI9L7lp.service: Deactivated successfully.
Aug 02 05:00:17 cirrus-task-6648908303040512 bash[212322]: --==/tmp==--

The line is there but only after the service is logged as Deactivated successfully which seem weird. So assuming systemctl --wait start returns before we have the line appear in journald it seems entirely possible that there are race conditions in systemd/journald regarding logging.

edsantiago commented 1 year ago

Yep, that's it. Reproduces in O(minutes) with this trivial reproducer:

# ./foo
.......................**FAILED foo-iiEcv4ZQKXkSceKxRuZI.service

-- No entries --
retrying after 2 seconds:
Aug 02 10:23:48 vm-10-0-186-124.hosted.upshift.rdu2.redhat.com systemd[1]: Started foo-iiEcv4ZQKXkSceKxRuZI.service - foo.
Aug 02 10:23:48 vm-10-0-186-124.hosted.upshift.rdu2.redhat.com systemd[1]: foo-iiEcv4ZQKXkSceKxRuZI.service: Deactivated successfully.
Aug 02 10:23:48 vm-10-0-186-124.hosted.upshift.rdu2.redhat.com bash[81070]: GOT HERE

I will now go look into systemd bugs and see what I can find. Wish me luck.

vrothberg commented 1 year ago

:crossed_fingers:

edsantiago commented 1 year ago

Filed https://github.com/systemd/systemd/issues/28650

edsantiago commented 1 year ago

Update: journald does not make any guarantees; I feel so stupid for thinking it would. I will start working on fixing the tests that rely on journalctl.

edsantiago commented 1 year ago

So, it's not just journald: tests fail in json-file too: f38 rootless, rawhide rootless. We've seen json-file flakes here before, but most failures are journald. Two json flakes in one run seems unusual.

Does this help chase down the bug?

edsantiago commented 1 year ago

I have a reproducer with json-file, nothing special needed, just a laptop (mine is f38):

$ nl="        <----- basically doublequote, enter, doublequote, enter
"
$ cr="^M"      <-------- control-V for quote-next-character, control-M for CR
$ while :;do cid=$(bin/podman run --log-driver json-file -dt quay.io/libpod/testimage:20221018 sh -c "echo podman;echo podman;echo podman");bin/podman wait $cid;logs=$(bin/podman logs --tail 2 $cid); test "$logs" = "podman${cr}${nl}podman${cr}" || break;bin/podman rm $cid;done
...
0
$ echo "$logs"
podman

$ echo "$logs" | cat -vET
podman^Mpodman$         <<<---------- Ooooh, isn't this interesting!
^M$
$ bin/podman logs $cid | cat -vET
podman^M$
podman^M$
podman^M$
edsantiago commented 1 year ago

Another failure (it doesn't take long). More info:

$ echo "$logs"|cat -vET
^M$
podman^M$
$ bin/podman logs $cid | cat -vET
podman^M$
podman^M$
podman^M$
$ bin/podman logs --tail 2 $cid | cat -vET
^M$
podman^M$                        <<<<---- OK, at least it's consistent
$ cat -vET ~/.local/share/containers/storage/overlay-containers/$cid/userdata/ctr.log
2023-08-07T19:56:34.223758260-06:00 stdout F podman^M$
2023-08-07T19:56:34.223758260-06:00 stdout P podman$
2023-08-07T19:56:34.223816701-06:00 stdout F ^M$
2023-08-07T19:56:34.223816701-06:00 stdout F podman^M$

I doubt that this is connected in any way to the journald bug, since that one reproduces easily without podman or conmon anywhere (or even installed), but this is still a bug.

Luap99 commented 1 year ago

Thanks for the reproducer, yes this seems to be a bug in our log reader. Basically with --tail we read last N lines from the file which breaks when it contains partial (P) lines. If we read the file backwards we see two full (F) lines so it looks good but F basically only means append newline, the correct way would be to continue reading backwards until the next F line and we must keep all partial lines in between as well.

edsantiago commented 1 year ago

Yes, precisely. Filed #19545.

edsantiago commented 1 year ago

Hi, this is still a big source of frustration. Any new insights?

Seen in: int+sys podman+remote debian-13+fedora-37+fedora-38 root+rootless host boltdb+sqlite