Throughput reported by 'podman push' is too high

syedriko commented 7 months ago

Issue Description

When pushing an image to quay.io, podman reports unrealistic throughput numbers:

$ podman push quay.io/foo/bar:zoo
Getting image source signatures
Copying blob 40d0b5be75be done   |
Copying blob e2604a9fd705 [==================================>---] 86.0MiB / 92.9MiB | 350.0 MiB/s
Copying blob 2e29575f0513 [==========================>-----------] 94.0MiB / 130.6MiB | 384.9 MiB/s

I've seen it report up 1.1 GiB/s, but failed to capture. My upload bandwidth is 20 Mbps as reported by https://speedtest.xfinity.com/results so these numbers are not realistic, something is off with arithmetic there.

Steps to reproduce the issue

Measure upload bandwidth
Push a somewhat large, > 100 MB, image, to an image registry, watch the throughput as reported by podman.
Compare the throughput as reported by podman to the available upload bandwidth

Describe the results you received

podman reported image upload speed of 350.0 MiB/s

Describe the results you expected

podman reports image upload speed within my cable modem's upload bandwidth

podman info output

$ podman info
host:
  arch: amd64
  buildahVersion: 1.32.0
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-3.fc39.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 95.32
    systemPercent: 0.66
    userPercent: 4.02
  cpus: 16
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: workstation
    version: "39"
  eventLogger: journald
  freeLocks: 2048
  hostname: fedora
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.5.10-300.fc39.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 10782474240
  memTotal: 67100966912
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.8.0-1.fc39.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.8.0
    package: netavark-1.8.0-2.fc39.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.8.0
  ociRuntime:
    name: crun
    package: crun-1.11.1-1.fc39.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.11.1
      commit: 1084f9527c143699b593b44c23555fb3cc4ff2f3
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20231004.gf851084-1.fc39.x86_64
    version: |
      pasta 0^20231004.gf851084-1.fc39.x86_64
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: false
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.2-1.fc39.x86_64
    version: |-
      slirp4netns version 1.2.2
      commit: 0ee2d87523e906518d34a6b423271e4826f71faf
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 8587571200
  swapTotal: 8589930496
  uptime: 4h 52m 31.00s (Approximately 0.17 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /home/syedriko/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/syedriko/.local/share/containers/storage
  graphRootAllocated: 1022488477696
  graphRootUsed: 346458644480
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 67
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/syedriko/.local/share/containers/storage/volumes
version:
  APIVersion: 4.7.0
  Built: 1695838680
  BuiltTime: Wed Sep 27 14:18:00 2023
  GitCommit: ""
  GoVersion: go1.21.1
  Os: linux
  OsArch: linux/amd64
  Version: 4.7.0

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

No

Additional environment details

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

giuseppe commented 7 months ago

does it show it constantly off or is it a temporary peak value?

syedriko commented 7 months ago

does it show it constantly off or is it a temporary peak value?

The values change, I've seen them fluctuate between 300 MiB/s and 1.1 GiB/s, but the entire range is unrealistic.

saschagrunert commented 7 months ago

I see the same issue with Podman 4.7:

Copying blob 21b98c2cb85e [========>-----------------------------] 25.0MiB / 100.0MiB | 1.3 GiB/s

I think we introduced the change in https://github.com/containers/image/pull/2015. Since pull stats are fine, I assume that we calculate something different when doing an image push.

Buildah is affected as well, so I think this one is a more general issue.

jmontleon commented 6 months ago

I see the same. It's reporting over 100MiB/s for a large image. My internet upload is only ~40 MB/s for upload.

saschagrunert commented 6 months ago

Unfortunately I had no further time spent on this. @mtrmac do you have a rough guess where the issue may come from?

mtrmac commented 6 months ago

I didn’t investigate in detail now, just some quick points:

The bandwidth is measured in terms of the source stream form; so on push, it is measuring the speed to read uncompressed data from c/storage. The code is does not track data allowing to decide that the “throughput that matters” is the network one, not the disk one, and to specifically measure at one or the other end of the copy pipeline. On slow links with fast CPUs (frequently compression is the slowest part), this could easily explain a factor of 2 or so, depending on the achieved compression ratio.
In principle, over time, the total throughput when reading from the source should (after buffers fill up) be limited by the throughput of compression and I/O writes, but it might be the case that this is measuring “instantaneous” throughput on reads from the source, without taking into account the time the copy is waiting for the destination writes, and not triggering any reads from the source at all. I don’t immediately know what the mpb progress bar code is doing, and possibly whether the computation is tunable (to be smoothed, or to explicitly take / not take unrelated waits into account).

vbauerster commented 4 months ago

@mtrmac both of your points are correct. In short this is happening because ewma based decorator is being used which goal is to measure “instantaneous” throughput. You just hit an edge case as I'm pretty sure it works ok with pull (download) and not ok with push (upload). Most of the time download operation involves slow reader and upload slow writer. It means:

ewma with ProxyReader combo works ok on download (slow reader)
ewma with ProxyReader combo works not ok on upload (slow writer)

You have several options to fix this. The most quick and simple is to use average based decorators instead. But if you'd like to maintain ewma based one so here are the options:

Opt in to ProxyWriter on push operations or simulate it with io.Pipe in order to keep just ProxyReader around.
Tune ewma weight value. The greater the value is the more will measurement behave like an average one.

You don't have to pick just one approach. I would handle push case with ProxyWriter and use 90 as ewma weight value instead of 30 (but you'd better check end result and tune accordingly). Underlying ewma lib treat 30 as special case which defaults to no need for WARMUP_SAMPLES variant.

Last but not least, manual bar increments require use of Ewma prefixed variants in order for ewma based decorators to work correctly (ProxyReader and ProxyWriter handle this automatically).

mtrmac commented 4 months ago

@vbauerster Thank you very much for the advice!

To reformulate in simple terms: reading the ProgressReader implementation, the throughput being recorded is only the time spent in Read(). So, in the ”slow consumer” situation, we spend a lot of time between each Read call and very little time inside the Read call.

In that case, the throughput computed is not just “within a factor of 2 or so”, but it can be many times the bandwidth of the link. I think that’s the full explanation, with no mysteries left.

Our pipeline design is such that the pipeline is actually executed “inside the destination”, and ultimately the slow write calls are only observed by that destination. In particular, pushes use the Go HTTP stack, where the input is provided as an io.Reader; we don’t have access to the Write-like operations that happen on the network socket deep inside http.Transport. That seems pretty hard to integrate with the ProxyWriter code measuring time spent.

One possible ~general solution for us might be to replace ProxyReader with a custom implementation which does not measure just the time inside Read, but all of the time since the previous Read finished. That would, I think, end up reporting the effective throughput of the pipeline — but still in the terms of the uncompressed input, so within a factor or ~2 of the network throughput.

An additional complication for push is that we know the size of the input, but we don’t know the size of the to-be-compressed output. So we would, ideally, want to report the progress percentage in terms of the uncompressed data, but throughput in terms of the compressed data — both on one progress bar line.

I’m sure that can be built (completely from scratch, or maybe by somehow providing a custom BarFiller), but it’s not something I plan to immediately work on.

Last but not least, manual bar increments require use of Ewma prefixed variants in order for ewma based decorators to work correctly (ProxyReader and ProxyWriter handles this automatically).

Thanks! https://github.com/containers/image/pull/2308 should fix that.

vbauerster commented 4 months ago

To reformulate in simple terms: reading the ProgressReader implementation, the throughput being recorded is only the time spent in Read(). So, in the ”slow consumer” situation, we spend a lot of time between each Read call and very little time inside the Read call.

Correct, the opposite is true for ProxyWriter implementation where time spent only in Write([]byte) is recorded.

One possible ~general solution for us might be to replace ProxyReader with a custom implementation which does not measure just the time inside Read, but all of the time since the previous Read finished. That would, I think, end up reporting the effective throughput of the pipeline — but still in the terms of the uncompressed input, so within a factor or ~2 of the network throughput.

If you don't have easy access to a writer then such ~general solution is very plausible. Worth to mention one more time that average decorators just work as expected most of the time.

mtrmac commented 4 months ago

I think if we spend 10 minutes uploading a layer, and then the upload stalls, we would want to converge to showing a 0 throughput reasonably quickly; not reporting half the throughput after another 10 minutes, quarter after another 20 minutes.

But then again, in most situations pushes complete well within a minute, or maybe within a few; so that might not be worth worrying about too much.

stevenschlansker commented 4 months ago

I think if we spend 10 minutes uploading a layer, and then the upload stalls, we would want to converge to showing a 0 throughput reasonably quickly; not reporting half the throughput after another 10 minutes, quarter after another 20 minutes. But then again, in most situations pushes complete well within a minute, or maybe within a few; so that might not be worth worrying about too much.

From a mildly-informed onlooker's perspective who noticed the currently busted progress indications, this is indeed an edge case, but it's also important. When upload pace is consistent, the algorithm hardly matters - ewma, average, etc will all look the same for constant inputs.

The biggest value of a progress bar is for longer operations (many minutes) to reflect an overall change in the shape of the operation - "the first half went at 10MB/s and the second half is now at 200KB/s" - so while I appreciate that there is only finite time to spend on niceties like this, I do not think you're worrying about it too much at all, since these edge cases are where the updating display is most useful :)

TrevorBenson commented 4 months ago

In that case, the throughput computed is not just “within a factor of 2 or so”, but it can be many times the bandwidth of the link. I think that’s the full explanation, with no mysteries left.

Agreed. I see a fairly constant 800 MiB/s - 1.2 GiB/s value for a 272MiB layer that has been in the process of pushing for 3 Minutes before it breaks 150MiB transfered.

Glad I found this ticket, I was starting to think I was going a little stir crazy :smile:

damaestro commented 3 months ago

I'm seeing order of magnitude issues. I'm seeing 1.2GiB/s when in reality the network stack is running at 12Mbps (~1.1MiB/s).

podman-4.9.3-1.fc39.x86_64

Frankkkkk commented 2 months ago

scrot-2024-04-13_22:41:21

Looks to be a factor of ~1024, built from b8a684b64d77fc97b3b539dd5891319756c15c39

frittentheke commented 1 week ago

The same issue (wrong and fluctuating throughput) is also present for pulls. Fixing one might likely also fix the other, but I wanted to make a note of it.

containers / podman