containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.83k stars 2.42k forks source link

Long k8s-file logs are truncated on 'podman logs --follow' output of stopped container #21914

Open UniversalSuperBox opened 8 months ago

UniversalSuperBox commented 8 months ago

Issue Description

When a container using the k8s-file log driver outputs a lot of logs, retrieving them with podman logs --follow is unreliable. If you're lucky, the entire log file is output successfully. If you're not lucky, the log is truncated at some unspecified place depending on the speed of your hardware.

Retrieving logs without --follow is unaffected. Retrieving logs from a running container is unaffected.

Steps to reproduce the issue

Steps to reproduce the issue

  1. Run a container that outputs a lot of logs with the k8s-file log driver. Allow it to stop completely.
  2. Run podman logs --follow on the container

I've provided a reproducer at https://gist.github.com/UniversalSuperBox/031f34ecb51c5bd04cb559542d9ba519. You can run 1test.sh and have it run podman logs until it fails, or just run podman logs --follow --latest | tail -n 1:

core@localhost:~/stuff$ podman logs --follow strange_shockley | tail -n 1
scelerisque semper tortor. Sed commodo libero id posuere facilisis. Cras
core@localhost:~/stuff$ podman logs --follow strange_shockley | tail -n 1

core@localhost:~/stuff$ podman logs --follow strange_shockley | tail -n 1

core@localhost:~/stuff$ podman logs --follow strange_shockley | tail -n 1
torquent per conubia nostra, per inceptos himenaeos. Nullam elementum
core@localhost:~/stuff$ podman logs --follow strange_shockley | tail -n 1
Morbi id elit ultrices, pulvinar odio sit amet, elementum lectus.
core@localhost:~/stuff$ podman logs --follow strange_shockley | tail -n 1
leo ac rutrum ullamcorper. Phasellus ut ultricies ligula. Phasellus

Describe the results you received

The length of the log differs on every run of podman logs --follow

Describe the results you expected

The entire log is output.

podman info output

host:
  arch: arm64
  buildahVersion: 1.33.5
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.10-1.fc39.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.10, commit: '
  cpuUtilization:
    idlePercent: 98.99
    systemPercent: 0.45
    userPercent: 0.56
  cpus: 8
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: coreos
    version: "39"
  eventLogger: journald
  freeLocks: 2017
  hostname: localhost.localdomain
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
    uidmap:
    - container_id: 0
      host_id: 502
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
  kernel: 6.7.5-200.fc39.aarch64
  linkmode: dynamic
  logDriver: journald
  memFree: 2676797440
  memTotal: 4084744192
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.10.0-1.fc39.aarch64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.10.0
    package: netavark-1.10.3-1.fc39.aarch64
    path: /usr/libexec/podman/netavark
    version: netavark 1.10.3
  ociRuntime:
    name: crun
    package: crun-1.14.3-1.fc39.aarch64
    path: /usr/bin/crun
    version: |-
      crun version 1.14.3
      commit: 1961d211ba98f532ea52d2e80f4c20359f241a98
      rundir: /run/user/502/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20240220.g1e6f92b-1.fc39.aarch64
    version: |
      pasta 0^20240220.g1e6f92b-1.fc39.aarch64-pasta
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/user/502/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.2-1.fc39.aarch64
    version: |-
      slirp4netns version 1.2.2
      commit: 0ee2d87523e906518d34a6b423271e4826f71faf
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 0
  swapTotal: 0
  uptime: 0h 50m 30.00s
  variant: v8
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /var/home/core/.config/containers/storage.conf
  containerStore:
    number: 5
    paused: 0
    running: 0
    stopped: 5
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/home/core/.local/share/containers/storage
  graphRootAllocated: 106769133568
  graphRootUsed: 13598543872
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 62
  runRoot: /run/user/502/containers
  transientStore: false
  volumePath: /var/home/core/.local/share/containers/storage/volumes
version:
  APIVersion: 4.9.3
  Built: 1708357248
  BuiltTime: Mon Feb 19 09:40:48 2024
  GitCommit: ""
  GoVersion: go1.21.7
  Os: linux
  OsArch: linux/arm64
  Version: 4.9.3

Podman in a container

No

Privileged Or Rootless

None

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

The issue occurs regardless of rootfulness.

This issue occurs whenever the log size gets above a certain point, that point seems dependent on hardware. Seems to be a race condition where the stop point is based on a certain time spent in the log reader, not an amount of data output. The issue occurs over the API, too, so you can use curl with the --limit-rate command to test that. When the rate is smaller, the amount of text downloaded is also smaller.

Command for flavor, overwrites binary-output2.bin in the current directory:

curl \
    --output 'binary-output2.bin' \
    --no-buffer \
    -XGET \
    --limit-rate=1000K
    --unix-socket "/run/user/${uid}/podman/podman.sock" \
    "http://localhost/v5.0.0/libpod/containers/${container}/logs?follow=True&stderr=True&stdout=True"
github-actions[bot] commented 7 months ago

A friendly reminder that this issue had no activity for 30 days.

pspacek commented 5 months ago

I confirm this observation. Observed on podman 5.1.1 running on Arch Linux x86_64.

Luap99 commented 4 months ago

@UniversalSuperBox Thanks for the script, I managed to hit it after 1000 iterations.

Looking at our code I cannot see anything in particular wrong but we use github.com/nxadm/tail for the tailing functionality and there is https://github.com/nxadm/tail/issues/67 reported so I assume this is a bug over there.

Luap99 commented 4 months ago

Actually running more tests it seem to be releated to https://github.com/nxadm/tail/issues/37 I think

Luap99 commented 4 months ago

To a lot longer than it should have been but https://github.com/nxadm/tail/pull/71 should fix the issue. Locally I am at 40000 iterations now without seeing the bug. Without the patch I was hitting it after 100-1000 iterations

rhatdan commented 4 months ago

Hopefully nxadm/tail is still active, last commit was 8 months ago.

pspacek commented 2 months ago

Is it time yet to fork nxadm/tail? Unreliable tail breaks automation scripts left and right.