podman logs: missing output

edsantiago commented 1 year ago

 [It] tail 800 lines
...
podman [options] logs --tail 800 CID
...
[FAILED] Expected <LONG STRING> to have length 800

Different CI runs on different days. root/rootless, f37/38, journald/json-file

fedora-37 : int podman fedora-37 root host sqlite
- 05-06 10:27 in tail 800 lines: journald
fedora-38 : int podman fedora-38 rootless host sqlite
- 05-05 15:51 in tail 800 lines: json-file

Echoes of #14362, except these failures are on main and include the wait fix.

edsantiago commented 1 year ago

I wonder if this could be a different manifestation of the same bug? Basically, podman runs a systemd container, spins waiting for "Reached target multi-user.target", and test fails because the string never appears in podman logs. This is podman-remote, though, so it could be a different issue.

Luap99 commented 1 year ago

I see some weird output formatting in the json-file case:

...
           line 424
line 425

           line 426
...

However I also see this in tests which pass so I am not sure if this is caused by ginkgo or the logformatter maybe? I also counted the lines in the log and see 800 so why does the matcher say 799???

The journal case is more concerning, we are actually missing 261 lines in the log.

In any case each line also prints a \r because we start the container with -t should we drop that?

edsantiago commented 1 year ago

As a general rule I like removing -t from tests; but ISTR that podman logs can behave differently with -t, and I don't know if there's a good reason for the -t in this test. (From a quick tig blame session, I don't think so, but I can't be sure). I think it's worth looking into removing -t here.

Luap99 commented 1 year ago

yeah with -t we only have one stream instead of separate stdout/err streams. However for the 800 line test, which I added a long time ago, the point was to test logs --tail with a size greater than the default pagesize (4096 bytes usually) read size. https://github.com/containers/podman/pull/7232

edsantiago commented 1 year ago

Repurposing this issue as a gathering place for all "podman logs missing output" flakes.

debian-12 : int podman debian-12 root host sqlite
- PR #17831
- 05-18 21:16 in tail 800 lines: json-file
debian-12 : int podman debian-12 rootless host sqlite
- PR #17831
- 05-18 14:08 in podman-remote pod logs test
debian-12 : int remote debian-12 root host sqlite [remote]
- PR #17831
- 05-05 15:57 in podman run container with systemd PID1
fedora-37 : int podman fedora-37 root host sqlite
- PR #17831
- 05-18 21:19 in podman pod logs -l: journald
- 05-17 11:58 in podman play kube seccomp container level
- 05-17 11:58 in podman play kube seccomp pod level
- 05-16 14:30 in using journald container name
- 05-11 09:45 in since duration 10m: journald
- 05-06 10:27 in tail 800 lines: journald
fedora-37 : int remote fedora-37 root host sqlite [remote]
- PR #17831
- 05-22 12:12 in since time 2017-08-07: journald
fedora-38 : int podman fedora-38 root host boltdb
- PR #17831
- 05-21 22:16 in podman pod logs -l: json-file
fedora-38 : int podman fedora-38 root host sqlite
- PR #17831
- 05-31 07:43 in podman pod logs -l: k8s-file
fedora-38 : int podman fedora-38 rootless host sqlite
- PR #17831
- 05-05 15:51 in tail 800 lines: json-file
rawhide : int podman rawhide root host sqlite
- PR #17831
- 05-27 09:52 in using container with container log-size: journald

edsantiago commented 1 year ago

Hey, is anyone looking into this? It's a serious problem, hitting a lot of different tests (suggesting it's not a badly written test). I would hate for customers to start hitting this in the field.

debian-12 : int podman debian-12 root host sqlite
- 05-18 21:16 in tail 800 lines: json-file
debian-12 : int podman debian-12 rootless host sqlite
- 05-18 14:08 in podman-remote pod logs test
debian-12 : int remote debian-12 root host sqlite [remote]
- 05-05 15:57 in podman run container with systemd PID1
fedora-37 : int podman fedora-37 root host sqlite
- 05-18 21:19 in podman pod logs -l: journald
- 05-17 11:58 in podman play kube seccomp container level
- 05-17 11:58 in podman play kube seccomp pod level
- 05-16 14:30 in using journald container name
- 05-11 09:45 in since duration 10m: journald
- 05-06 10:27 in tail 800 lines: journald
fedora-37 : int remote fedora-37 root host sqlite [remote]
- 05-22 12:12 in since time 2017-08-07: journald
fedora-38 : int podman fedora-38 root host boltdb
- 05-21 22:16 in podman pod logs -l: json-file
fedora-38 : int podman fedora-38 root host sqlite
- 05-31 07:43 in podman pod logs -l: k8s-file
fedora-38 : int podman fedora-38 rootless host sqlite
- 05-05 15:51 in tail 800 lines: json-file
rawhide : int podman rawhide root host sqlite
- 05-27 09:52 in using container with container log-size: journald

vrothberg commented 1 year ago

I am flooded with upstream issues and bugs at the moment but totally agree with you Ed.

@mheon, can we schedule a bug week soon-ish?

vrothberg commented 1 year ago

It's bookmarked, so I will take a look when I have some cycles (no ETA).

Luap99 commented 1 year ago

Some tests are definitely broken (missing wait or should not use run -d, i.e. podman pod logs -l)

The 800 lines thing is interesting because the output is screwed up with the \r bytes somehow: https://api.cirrus-ci.com/v1/artifact/task/6317242757939200/html/int-podman-debian-12-root-host-sqlite.log.html#t--tail-800-lines--json-file--1 But it does the same thing in tests that pass so I don't think that is the cause. And even more weird it actually shows the full 800 lines output in the log so why does ginkgo say it is only 799 lines?!!!!!!

This one is odd, looks like the linker is broken: https://api.cirrus-ci.com/v1/artifact/task/6365092854366208/html/int-podman-debian-12-rootless-host-sqlite.log.html#t--podman-remote-pod-logs-test--1 But then again other tests in this run pass so why does loading a shared lib flake with ENOENT????????

edsantiago commented 1 year ago

Oh, I didn't look closely enough at that one. It's probably #17042, another scary-nasty flake. I've reclassified it.

edsantiago commented 1 year ago

Here's one from yesterday, 800-line test, journald. It's a humongous failure, expecting 800 lines, getting 551. Looking at the journal, search in-page for "line 551", we find:

Jun 26 16:46:46 cirrus-task-6755306202464256 eloquent_williamson[223648]: line 551
Jun 26 16:46:46 cirrus-task-6755306202464256 eloquent_williamson[223648]: line 552
Jun 26 16:46:47 cirrus-task-6755306202464256 systemd-journald[557]: Data hash table of /var/log/journal/eedc8417f8cb3f847c826e088dfd67ab/system.journal has a fill level at 75.0 (174763 of 233016 items, 50331648 file size, 287 bytes per hash table item), suggesting rotation.
Jun 26 16:46:47 cirrus-task-6755306202464256 systemd-journald[557]: /var/log/journal/eedc8417f8cb3f847c826e088dfd67ab/system.journal: Journal header limits reached or header out-of-date, rotating.
........
Jun 26 16:46:46 cirrus-task-6755306202464256 rsyslogd[666]: imjournal: journal files changed, reloading...  [v8.2210.0-4.fc38 try https://www.rsyslog.com/e/0 ]
Jun 26 16:46:46 cirrus-task-6755306202464256 eloquent_williamson[223648]: line 553
Jun 26 16:46:47 cirrus-task-6755306202464256 rsyslogd[666]: imjournal: journal files changed, reloading...  [v8.2210.0-4.fc38 try https://www.rsyslog.com/e/0 ]

Looks like a smoking gun to me.

Luap99 commented 1 year ago

Yeah that is definitely suspicious and in fact we miss the logs after that line but there are also other test failures were we actually saw all lines in the log so I am not sure it is the only bug.

edsantiago commented 1 year ago

The new logformatter adds direct links to journal, making it much easier to look into these. So like here's one of those 800-line failures, k8s-file (not journald), and it looks like all the lines are there (I didn't count) ... but journal shows a conmon error:

Jun 15 16:51:11 cirrus-task-6083635929939968 conmon[93450]: conmon 6a82177ee9e67f96f12c <nwarn>: Failed to open cgroups file: /sys/fs/cgroup/machine.slice/libpod-6a82177ee9e67f96f12c938169b73c16537188fa8d712e151d2f7a2505ee21aa.scope/container/memory.events
Jun 15 16:51:11 cirrus-task-6083635929939968 peaceful_bhaskara[93450]: line 488
Jun 15 16:51:11 cirrus-task-6083635929939968 conmon[93450]: conmon 6a82177ee9e67f96f12c <nwarn>: stdio_input read failed Input/output error

Luap99 commented 1 year ago

Ok I guess the first step is to actually make ginkgo print the full matcher output so we can at least know which element is missing. Because AFAICS podman logs printed all lines we would expect.

edsantiago commented 1 year ago

Here's an interesting variation: remote f38 root. The interesting thing about it is journalctl being run, and not seeing any output (vs podman logs).

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

edsantiago commented 1 year ago

Oops, I thought I had posted a ha-ha-very-funny-github-actions-bot comment last night. Guess I failed to press the button. Anyhow, ha ha, very funny github-actions-bot:

debian-12 : int remote debian-12 root host sqlite [remote]
- PR #17831
- 07-09 10:51 in Podman play kube podman play kube test correct output
- 07-08 20:54 in Podman logs tail 2 lines with timestamps: journald
fedora-37 : int remote fedora-37 root host sqlite [remote]
- PR #17831
- 07-12 18:38 in Podman play kube podman play kube test with hostIPC
- 07-10 18:43 in Podman logs podman logs test stdout and stderr: journald
- 07-10 14:03 in Podman play kube podman play kube test correct output
- 07-09 10:42 in Podman logs podman logs test stdout and stderr: journald
- 06-29-2023 19:42 in Podman logs tail 99 lines: journald
- 06-29-2023 07:53 in Podman logs tail 2 lines with timestamps: journald
- 06-28-2023 23:09 in Podman play kube podman play kube test with HostNetwork
fedora-37 : sys podman fedora-37 root host boltdb
- PR #19469
- 08-02 06:12 in [sys] quadlet - volume path using systemd %T specifier
fedora-37 : sys remote fedora-37 root host boltdb [remote]
- PR #19455
- 07-31 16:49 in [sys] podman logs - --since --follow journald
fedora-38 : int remote fedora-38 root host boltdb [remote]
- PR #17831
- 07-11 07:37 in Podman logs tail 800 lines: journald
- 07-10 15:09 in Podman logs podman logs test stdout and stderr: journald
- 07-06 23:20 in Podman play kube podman play kube test with hostIPC
- 06-29-2023 07:48 in Verify podman containers.conf usage using journald for container with container log_tag
fedora-38 : int remote fedora-38 root host sqlite [remote]
- PR #17831
- 07-10 18:43 in Verify podman containers.conf usage using journald for container with container log_tag
- 07-09 10:42 in Podman logs podman logs partial log lines: journald
- 07-07 19:14 in Podman play kube podman play kube test with init containers and annotation set
- 07-06 20:36 in Podman logs using container with container log-size: journald
- 07-06 12:51 in Podman play kube podman play kube should be able to run image where workdir is a symlink
- 07-05 20:03 in Podman logs tail 2 lines with timestamps: journald
rawhide : int remote rawhide root host sqlite [remote]
- PR #17831
- 07-10 16:28 in Verify podman containers.conf usage using journald for container with container log_tag
- 07-09 09:00 in Verify podman containers.conf usage using journald for container with container log_tag
- 07-07 20:55 in Podman logs Make sure logs match expected length: journald
rawhide : sys podman rawhide rootless host sqlite
- PR #19385
- 07-27 10:56 in [sys] events - container inspect data - journald

edsantiago commented 1 year ago

This is an interesting variation on the theme:

[+1297s] not ok 375 quadlet - volume path using systemd %T specifier
...
         # # journalctl -u get-percent-t.0h8sI9L7lp.service
         # Aug 02 05:00:17 cirrus-task-6648908303040512 systemd[1]: Started get-percent-t.0h8sI9L7lp.service - Get the value of percent T.
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #|     FAIL: get-percent-T unit ran to completion
         # #| expected: =~  --==.\*==--
         # #|   actual:    Aug 02 05:00:17 cirrus-task-6648908303040512 systemd\[1\]: Started get-percent-t.0h8sI9L7lp.service - Get the value of percent T.
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Basically: systemctl --wait start UNIT, then journalctl -u UNIT, and the output from UNIT is not seen. Podman is not in the picture anywhere, suggesting that maybe the bug is in systemd/journald?

Luap99 commented 1 year ago

well on the good side the ones you linked look all journald related so journald bugs are very much possible.

From https://api.cirrus-ci.com/v1/task/6648908303040512/logs/journal.log:

Aug 02 05:00:17 cirrus-task-6648908303040512 systemd[1]: Started get-percent-t.0h8sI9L7lp.service - Get the value of percent T.
Aug 02 05:00:17 cirrus-task-6648908303040512 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=get-percent-t.0h8sI9L7lp comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 02 05:00:17 cirrus-task-6648908303040512 systemd[1]: get-percent-t.0h8sI9L7lp.service: Deactivated successfully.
Aug 02 05:00:17 cirrus-task-6648908303040512 bash[212322]: --==/tmp==--

The line is there but only after the service is logged as Deactivated successfully which seem weird. So assuming systemctl --wait start returns before we have the line appear in journald it seems entirely possible that there are race conditions in systemd/journald regarding logging.

edsantiago commented 1 year ago

Yep, that's it. Reproduces in O(minutes) with this trivial reproducer:

# ./foo
.......................**FAILED foo-iiEcv4ZQKXkSceKxRuZI.service

-- No entries --
retrying after 2 seconds:
Aug 02 10:23:48 vm-10-0-186-124.hosted.upshift.rdu2.redhat.com systemd[1]: Started foo-iiEcv4ZQKXkSceKxRuZI.service - foo.
Aug 02 10:23:48 vm-10-0-186-124.hosted.upshift.rdu2.redhat.com systemd[1]: foo-iiEcv4ZQKXkSceKxRuZI.service: Deactivated successfully.
Aug 02 10:23:48 vm-10-0-186-124.hosted.upshift.rdu2.redhat.com bash[81070]: GOT HERE

I will now go look into systemd bugs and see what I can find. Wish me luck.

vrothberg commented 1 year ago

:crossed_fingers:

edsantiago commented 1 year ago

Filed https://github.com/systemd/systemd/issues/28650

edsantiago commented 1 year ago

Update: journald does not make any guarantees; I feel so stupid for thinking it would. I will start working on fixing the tests that rely on journalctl.

edsantiago commented 1 year ago

So, it's not just journald: tests fail in json-file too: f38 rootless, rawhide rootless. We've seen json-file flakes here before, but most failures are journald. Two json flakes in one run seems unusual.

Does this help chase down the bug?

edsantiago commented 1 year ago

I have a reproducer with json-file, nothing special needed, just a laptop (mine is f38):

$ nl="        <----- basically doublequote, enter, doublequote, enter
"
$ cr="^M"      <-------- control-V for quote-next-character, control-M for CR
$ while :;do cid=$(bin/podman run --log-driver json-file -dt quay.io/libpod/testimage:20221018 sh -c "echo podman;echo podman;echo podman");bin/podman wait $cid;logs=$(bin/podman logs --tail 2 $cid); test "$logs" = "podman${cr}${nl}podman${cr}" || break;bin/podman rm $cid;done
...
0
$ echo "$logs"
podman

$ echo "$logs" | cat -vET
podman^Mpodman$         <<<---------- Ooooh, isn't this interesting!
^M$
$ bin/podman logs $cid | cat -vET
podman^M$
podman^M$
podman^M$

edsantiago commented 1 year ago

Another failure (it doesn't take long). More info:

$ echo "$logs"|cat -vET
^M$
podman^M$
$ bin/podman logs $cid | cat -vET
podman^M$
podman^M$
podman^M$
$ bin/podman logs --tail 2 $cid | cat -vET
^M$
podman^M$                        <<<<---- OK, at least it's consistent
$ cat -vET ~/.local/share/containers/storage/overlay-containers/$cid/userdata/ctr.log
2023-08-07T19:56:34.223758260-06:00 stdout F podman^M$
2023-08-07T19:56:34.223758260-06:00 stdout P podman$
2023-08-07T19:56:34.223816701-06:00 stdout F ^M$
2023-08-07T19:56:34.223816701-06:00 stdout F podman^M$

I doubt that this is connected in any way to the journald bug, since that one reproduces easily without podman or conmon anywhere (or even installed), but this is still a bug.

Luap99 commented 1 year ago

Thanks for the reproducer, yes this seems to be a bug in our log reader. Basically with --tail we read last N lines from the file which breaks when it contains partial (P) lines. If we read the file backwards we see two full (F) lines so it looks good but F basically only means append newline, the correct way would be to continue reading backwards until the next F line and we must keep all partial lines in between as well.

edsantiago commented 1 year ago

Yes, precisely. Filed #19545.

edsantiago commented 1 year ago

Hi, this is still a big source of frustration. Any new insights?

debian-13 : sys podman debian-13 rootless host sqlite
- PR #17831
- 09-27-2023 10:38 in [sys] podman logs - --since --follow journald
fedora-37 : sys podman fedora-37 root host boltdb
- PR #20190
- 09-28 15:08 in [sys] podman kube play with umask from containers.conf
fedora-38 : int remote fedora-38 root host sqlite [remote]
- PR #17831
- 09-14-2023 21:34 in Podman play kube podman play kube test with sysctl defined
fedora-38 : sys podman fedora-38 root host boltdb
- PR #20451
- 10-23 17:42 in [sys] podman kube play with configmaps
- PR #20385
- 10-19 15:25 in [sys] podman kube play with umask from containers.conf
- 10-19 14:27 in [sys] podman kube play with configmaps
- 10-19 13:33 in [sys] podman kube play with configmaps
- PR #20357
- 10-18 18:22 in [sys] podman kube play with configmaps
- 10-18 16:46 in [sys] podman kube play with umask from containers.conf
- PR #20161
- 10-17 15:44 in [sys] sdnotify : play kube - with policies
fedora-38 : sys podman fedora-38 root host sqlite
- PR #17831
- 10-25 11:30 in [sys] sdnotify : play kube - with policies
- 10-05 18:34 in [sys] sdnotify : play kube - with policies
fedora-38 : sys podman fedora-38 rootless host boltdb
- PR #20161
- 10-17 15:43 in [sys] sdnotify : play kube - with policies
- 10-10 15:23 in [sys] sdnotify : play kube - with policies
fedora-38 : sys podman fedora-38 rootless host sqlite
- PR #17831
- 10-09 17:47 in [sys] sdnotify : play kube - with policies

Seen in: int+sys podman+remote debian-13+fedora-37+fedora-38 root+rootless host boltdb+sqlite

containers / podman

podman logs: missing output #18501