containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.84k stars 2.42k forks source link

Pasta: "ASSERTION FAILED in flow_defer_handler (flow.c:315): !flow_new_entry" #22925

Closed Jojonintendo closed 5 months ago

Jojonintendo commented 5 months ago

Issue Description

I have a homelab with more than 70 podman containers, all behind a caddy reverse proxy that's also containerized. All of them are run with the following network config:

--net=pasta:--ipv4-only,-a,10.0.2.100,-n,24,-g,10.0.2.2,--dns-forward,10.0.2.3,-m,1500,--no-ndp,--no-dhcpv6,--no-dhcp \

With pasta version 2024_05_23.765eb0b-1 I get the following error in caddy's logs:

ASSERTION FAILED in flow_defer_handler (flow.c:315): !flow_new_entry

This doesn't happen always, and seems to be triggered by the steps detailed below:

  1. Stop a container, for example AdGuard Home
  2. Try to access AdGuard's webui a few times, or have Uptime Kuma try to ping it
  3. The error appears in the logs, and none of the containers are accessible from outside

All the reverse proxying stops working, however containers can still reach each other via the IP of the host 192.168.x.x (I have seen Home Assistant still be able to talk to Zigbee2MQTT, ESPHome, etc).

When this error appears, it's always the last line in the logs. Caddy stays up but no more logs are shown and no other service is reachable. Upon restarting caddy and the other container that was down (AdGuard in this example), everything works again.

This is troublesome as I have many containers self-updating daily with podman-auto-update.timer. If a container stays down too long the error is more likely to appear, and it never recovers by itself.

Reverting to pasta version 2024_05_10.7288448-1 fixes the issue completely.

Steps to reproduce the issue

Steps to reproduce the issue

  1. Stop a container, for example AdGuard Home
  2. Try to access AdGuard's webui a few times, or have Uptime Kuma try to ping it
  3. The error appears in the logs, and none of the containers are accessible from outside

Describe the results you received

Caddy reverse proxy shows the pasta error in its logs and doesn't go any further, like it's frozen. The container stays up and looks fine though, its healt check doesn't trigger a restart. No service is proxied, so all services appear as down from the outside.

Describe the results you expected

With pasta version 2024_05_10.7288448-1 this error doesn't show up, and caddy keeps working no matter how many services might be temporarily down.

podman info output

host:                                                                                                                  
  arch: amd64                     
  buildahVersion: 1.36.0    
  cgroupControllers:
  - cpu       
  - memory                                         
  - pids               
  cgroupManager: systemd
  cgroupVersion: v2    
  conmon:                                             
    package: /usr/bin/conmon appartient à conmon 1:2.1.12-1
    path: /usr/bin/conmon
    version: 'conmon version 2.1.12, commit: e8896631295ccb0bfdda4284f1751be19b483264'
  cpuUtilization:
    idlePercent: 89.1
    systemPercent: 4.46       
    userPercent: 6.43                                                                                                  
  cpus: 8     
  databaseBackend: sqlite     
  distribution:        
    distribution: arch                              
    version: unknown                                                                                                   
  eventLogger: journald                                                                                                
  freeLocks: 1863                                                                                                      
  hostname: wall-e             
  idMappings:             
    gidmap:                                
    - container_id: 0      
      host_id: 984
      size: 1                            
    - container_id: 1                                                                                                                                                                                                                         
      host_id: 100000                               
      size: 65536       
    uidmap:                                         
    - container_id: 0    
      host_id: 1000                  
      size: 1                                                                                                          
    - container_id: 1               
      host_id: 100000                                                                                                  
      size: 65536    
  kernel: 6.9.3-arch1-1        
  linkmode: dynamic                                   
  logDriver: journald
  memFree: 35417407488                                                                                                                                                                                                      15:57:48 [50/1415]
  memTotal: 67295506432
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: /usr/lib/podman/aardvark-dns appartient à aardvark-dns 1.11.0-1
      path: /usr/lib/podman/aardvark-dns
      version: aardvark-dns 1.11.0
    package: /usr/lib/podman/netavark appartient à netavark 1.11.0-1
    path: /usr/lib/podman/netavark
    version: netavark 1.11.0
  ociRuntime:
    name: crun
    package: /usr/bin/crun appartient à crun 1.15-1
    path: /usr/bin/crun
    version: |-
      crun version 1.15
      commit: e6eacaf4034e84185fd8780ac9262bbf57082278
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: /usr/bin/pasta appartient à passt 2024_05_10.7288448-1
    version: |
      pasta 2024_05_10.7288448
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law. 
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: /usr/bin/slirp4netns appartient à slirp4netns 1.3.1-1
    version: |-
      slirp4netns version 1.3.1
      commit: e5e368c4f5db6ae75c2fce786e31eef9da6bf236
      libslirp: 4.8.0
      SLIRP_CONFIG_VERSION_MAX: 5
      libseccomp: 2.5.5
  swapFree: 16807874560
  swapTotal: 17179865088
  uptime: 0h 38m 36.00s
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
  configFile: /home/jonas/.config/containers/storage.conf
  containerStore:
    number: 74
    paused: 0
    running: 74
    stopped: 0
  graphDriverName: btrfs
  graphOptions: {}
  graphRoot: /home/jonas/SSD/podman/storage
  graphRootAllocated: 1000204886016
  graphRootUsed: 45313536000
  graphStatus:
    Build Version: Btrfs v6.8.1
    Library Version: "102"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 73
  runRoot: /home/jonas/SSD/podman/storage
  transientStore: false
  volumePath: /home/jonas/SSD/podman/storage/volumes
version:
  APIVersion: 5.1.1
  Built: 1717539130
  BuiltTime: Wed Jun  5 00:12:10 2024
  GitCommit: bda6eb03dcbcf12a5b7ae004c1240e38dd056d24-dirty 
  GoVersion: go1.22.3
  Os: linux
  OsArch: linux/amd64
  Version: 5.1.1

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

Pasta version 2024_05_23.765eb0b-1 is the latest in the repos so I could try any newer version, and have it locked at 2024_05_10.7288448-1.

The rest of the system uses the latest versions, including podman 5.1.1.

Additional information

No response

baude commented 5 months ago

in the meanwhile, confirming the problem exists in the latest version (or even main branch) would be helpful.

Luap99 commented 5 months ago

cc @sbrivio-rh @dgibson

sbrivio-rh commented 5 months ago

in the meanwhile, confirming the problem exists in the latest version (or even main branch) would be helpful.

Generally a good idea, thanks :) but there's no need here, 2024_05_23.765eb0b-1 is already the latest version, and the difference between current HEAD are four commits that have nothing to do with this issue.

sbrivio-rh commented 5 months ago

--net=pasta:--ipv4-only,-a,10.0.2.100,-n,24,-g,10.0.2.2,--dns-forward,10.0.2.3,-m,1500,--no-ndp,--no-dhcpv6,--no-dhcp \

It would be helpful to add debug options to this, but there's currently an issue with those versions of pasta and Podman: Podman passes --quiet by default, and pasta doesn't like --debug (or --trace) together with --quiet. :( I think we should eventually drop that check in pasta.

If you have the chance to rebuild pasta (dropping that check) or Podman (not passing --quiet), could you please add something like --trace,--log-file,/tmp/pasta_caddy.log,--log-size,1000000000 to that, and share the contents of the log file? Dropping IP addresses is fine if they're confidential.

dgibson commented 5 months ago

Yeah, I think "last option wins" is the usual convention for conflicting logging options. We should definitely switch to that for exactly this sort of reason.

The specific assertion indicates we have an internal bug where we're returning to the main loop having only partially constructed a new flow. Unfortunately, without the debug logging it's a bit hard to tell the sequence of events that lead up to it.

dgibson commented 5 months ago

I don't know for certain if it's the cause of this problem, but I did spot by inspection some error paths where we might not be cleaning up properly after ourselves, in a way that could cause this assertion to trip. Patch coming soon.

dgibson commented 5 months ago

Ok, I've now made a draft patch which should fix the problems I saw - again these plausibly could be the cause of this bug, but I can't be certain. I've posted it to the passt dev list, and I've also pushed it to a gitlab branch. @Jojonintendo if you're able to test that branch, that would be great.

Jojonintendo commented 5 months ago

I'm not very literate with git, but I think it's working. Here's how I've tried the new branch:

  1. git clone repo (the link in previous comment)
  2. git switch podman22925
  3. make && make install

I now have the pasta binary in /usr/local/bin, which is the one in use by the system. After a reboot everything seems fine, I couldn't trigger the error. I'll let it as is for a few days, but to me it sounds like it's fixed.

Previously this error would appear on boot every time, while all the containers are also starting. Now it just boots fine and I can't trigger the failure, which was very easy before.

How long should I let this run for before considering it healthy?

Anyway, many thanks for the blazing fast support, amazing job everyone!

sbrivio-rh commented 5 months ago

How long should I let this run for before considering it healthy?

Looking at David's patch, I came up with an embarrassing simple reproducer, which I think is similar to what happened in your case: the container would try to access an unreachable port using a local, but not loopback address, and we end up in the first path that the patch fixes.

The reproducer is: ./pasta --config-net -t none -u none -T none -U none -- sh -c 'nc $(ip -j -4 route show | jq -rM ".[] | select(.dst == \"default\").gateway") 12345, assuming that port 12345 is not bound on the host.

It's pretty bad, so I just made a new release with the fix. I guess it will reach Arch Linux soon.

Jojonintendo commented 5 months ago

It's already in the repos, I've installed it and rebooted the machine. Everything is fine, thanks for the information and quick fix!