containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.51k stars 2.39k forks source link

Podman randomly stops forwarding traffic with pasta networking #19405

Closed waffshappen closed 1 year ago

waffshappen commented 1 year ago

Issue Description

When binding a port with podman run -p 8080:8080 --network pasta $otherargs at a random point in time after starting the container no more external traffic will be able to reach services bound inside the container. There are no log entries in journald, no network changes - and it happens completely at random, on multiple systems.

This works if its bound to a specific ip (127.0.0.1 for example) instead.

If i remember correctly this started happening around or shortly after the 4.5.0 release but still occurs after multiple passt and podman updates on Fedora since then.

Steps to reproduce the issue

Steps to reproduce the issue

  1. Start a podman container (rootless) with --network pasta and a bound -p 8080:8080 port
  2. Access the working service (e.g. Mumble, gitea) and then wait (usually takes a day)
  3. Try to access the service again - Service now doesn't respond

Describe the results you received

Nothing inside the container can respond to traffic anymore on external ips

Describe the results you expected

Services should still work

podman info output

arch: amd64
  buildahVersion: 1.30.0
  cgroupControllers:
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc38.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 99.2
    systemPercent: 0.38
    userPercent: 0.43
  cpus: 8
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    version: "38"
  eventLogger: journald
  hostname: [...]
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.4.4-200.fc38.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 238510080
  memTotal: 8256086016
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.8.5-1.fc38.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.5
      commit: b6f80f766c9a89eb7b1440c0a70ab287434b17ed
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-12.fc38.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 7763914752
  swapTotal: 8255434752
  uptime: 309h 34m 23.00s (Approximately 12.88 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /home/tobias/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/tobias/.local/share/containers/storage
  graphRootAllocated: 123086045184
  graphRootUsed: 8763510784
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 5
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/tobias/.local/share/containers/storage/volumes
version:
  APIVersion: 4.5.1
  Built: 1685123928
  BuiltTime: Fri May 26 19:58:48 2023
  GitCommit: ""
  GoVersion: go1.20.4
  Os: linux
  OsArch: linux/amd64
  Version: 4.5.1

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

No

Additional environment details

No response

Additional information

No response

Luap99 commented 1 year ago

Please provide your pasta version. Does it only effect the port forwarding or are outgoing connections also effected? Is the pasta process still running when it is running?

cc @sbrivio-rh @dgibson

waffshappen commented 1 year ago

Please provide your pasta version.

pasta 0^20230625.g32660ce-1.fc38.x86_64

Does it only effect the port forwarding or are outgoing connections also effected?

Good point, didnt think of that. Indeed the container cannot access anything outside once this occurs

curl -v http://1.1.1.1
*   Trying 1.1.1.1:80...
* Immediate connect fail for 1.1.1.1: Network is unreachable

This works on the host of course.

Is the pasta process still running when it is running?

Yes. Here an example for a jellyfin Container (8096 exposed) that is currently in unreachable state:

ps -top | grep pas
tobias   2132678  0.0  0.1  76144 10168 ?        Ss   Jul27   0:35 /usr/bin/pasta --config-net -t 8096-8096:8096-8096 -u none -T none -U none --no-map-gw --netns /run/user/1000/netns/netns-e1e14b0a-c0a7-fc4c-1e75-b31778702fe1
dgibson commented 1 year ago

Thanks for the report, there's not a lot to go on here but there are a few clues.

* Immediate connect fail for 1.1.1.1: Network is unreachable

The fact that we're getting a network unreachable error suggests one of two things is happening:

  1. Somehow the container's network configuration has been altered so that it doesn't see the pasta provided network
  2. pasta isn't responding to ARP requests

@waffshappen to pin down which of these it is, could you provide the output for ip link, ip addr and ip route from within an affected container?

The pasta process you show doesn't appear to be actively running, which suggests the problem is not that we've somehow got into an infinite loop doing nothing useful.

waffshappen commented 1 year ago

@waffshappen to pin down which of these it is, could you provide the output for ip link, ip addr and ip route from within an affected container?

Sure! Sorry for the delay, had to wait for a newly spawned container with ip installed to go into unresponsive state because the other containers do not have that available default and without working connectivity i cant exactly add that.

From a Fedora 38 Container with a simple webserver that now is unresponsive again after ~24 hours:

ip link

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc cake state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 32:8f:4f:df:b3:44 brd ff:ff:ff:ff:ff:ff

ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc cake state UNKNOWN group default qlen 1000
    link/ether 32:8f:4f:df:b3:44 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::308f:4fff:fedf:b344/64 scope link 
       valid_lft forever preferred_lft forever

ip route

No output

The pasta process you show doesn't appear to be actively running, which suggests the problem is not that we've somehow got into an infinite loop doing nothing useful.

That was what made me switch to pasta initially, given that it was rather easy to get slirp4netns into 100% usage situations for me whereas pasta handled that (udp packages sent to a host to check it -wasnt- reachable (nftables DROP), piling up in slirp forever) setup perfectly.

sbrivio-rh commented 1 year ago

From a Fedora 38 Container with a simple webserver that now is unresponsive again after ~24 hours:

[...]

ip route

No output

This might sound a bit absurd, but... do you happen to have a DHCP client (possibly NetworkManager) running in the container? I can't explain why routes would disappear after ~24 hours otherwise.

dgibson commented 1 year ago

Right, the container is losing all its addresses and routes. That certainly explains why it loses connectivity.

I can't see how pasta would touch those other than once during startup, and indeed past has no code at all to delete addresses or routes, only add them. So, I think it has to be something within the container actually doing the damage - maybe a DHCP client as @sbrivio-rh suggests. That still leaves the question of why it's doing that with pasta but not with slirp4netns (assuming that's the case, anyway).

I think the first stop is to look for any obvious DHCP clients in the container. If that doesn't lead to anything, I think we need to look for ways to monitor netlink activity within the container.

waffshappen commented 1 year ago

This might sound a bit absurd, but... do you happen to have a DHCP client (possibly NetworkManager) running in the container? I can't explain why routes would disappear after ~24 hours otherwise.

No, there are no active DHCP servers, clients or like that else inside the containers. Infact the specific Fedora Container has a single running binary: bash. (Dito for jellyfin (unless they do something cursed for network sharing) and mumble)

I think the first stop is to look for any obvious DHCP clients in the container.

And to my knowledge none are running.

If that doesn't lead to anything, I think we need to look for ways to monitor netlink activity within the container.

If that is tcpdump-able i might be able to just tcpdump until it looses connectivity and store that to a volume? But doing so would require re-creating the container to install it, then wait until it happens again.

dgibson commented 1 year ago

This might sound a bit absurd, but... do you happen to have a DHCP client (possibly NetworkManager) running in the container? I can't explain why routes would disappear after ~24 hours otherwise.

No, there are no active DHCP servers, clients or like that else inside the containers. Infact the specific Fedora Container has a single running binary: bash. (Dito for jellyfin (unless they do something cursed for network sharing) and mumble)

I think the first stop is to look for any obvious DHCP clients in the container.

And to my knowledge none are running.

Drat. So much for an easy answer.

If that doesn't lead to anything, I think we need to look for ways to monitor netlink activity within the container.

If that is tcpdump-able i might be able to just tcpdump until it looses connectivity and store that to a volume? But doing so would require re-creating the container to install it, then wait until it happens again.

It is possible to use tcpdump here, but there are dedicated tools (rtmon and ip monitor) that are probably more useful. However, those primarily show what netlink events occur, whereas we're more concerned with who is performing the netlink operations. For finding the latter systemtap might be more useful. All of these options will require recreating the container and waiting for the problem to reproduce, as you note.

Let's gather a little more background information before we attempt that though.

  1. Can you give a general idea of what the container is for, and what it does? Maybe this will give some clues as to where to focus our investigation.
  2. Can you provide the output from ps afx from within the container? This is on the off chance that there's something non-obvious to you that stands out to me or my colleagues as a clue.

I don't think we can 100% rule out a DHCP client as the culprit yet - it doesn't seem like one is running persistently, but it's possible one ran transiently at the point things broke. So, I think it's worth checking what effect running a DHCP client would have:

  1. In the container in the broken state, try running dhclient -v eno1 manually. Does this give any errors? Is network connectivity restored after it completes?
  2. In a container in unbroken state (but based on the same image), try the same thing. Does this give any errors? Does it break network connectivity?
waffshappen commented 1 year ago
1. Can you give a general idea of what the container is for, and what it does?  Maybe this will give some clues as to where to focus our investigation.

The fedora Container is just a fedora:latest container running bash and a minimal webserver, specifically to reproduce this.

The mumble Container only runs https://hub.docker.com/r/mumblevoip/mumble-server - to run, well, mumble. Listens on tcp and udp - it was how i was first made aware of this bug as users could not connect anymore, at random. I used pasta to have the container see the real user ips without host networking - allowed the internal abuse limits to not apply to "all" ips since rootless changed them all to localhost of course.

Jellyfin is an instance of https://hub.docker.com/r/jellyfin/jellyfin - all three of thse are single purpose containers pretty much to run a single app with no other automations, management tools etc. added outside of whats shipped. And atleast these three dont even use pods or similar.

2. Can you provide the output from `ps afx` from within the container?  This is on the off chance that there's something non-obvious to you that stands out to me or my colleagues as a clue.

None of these has ps installed default, i'll add it to the "restart, install, wait" list.

However from the host for the fedora container conmon:

3044584 ?        Ss     0:00 /usr/bin/conmon --api-version 1 -c fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45 -u fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45 -r /usr/bin/crun -b /home/tobias/.local/share/containers/storage/overlay-containers/fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45/userdata -p /run/user/1000/containers/overlay-containers/fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45/userdata/pidfile -n fedotest --exit-dir /run/user/1000/libpod/tmp/exits --full-attach -l journald --log-level warning --syslog --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/user/1000/containers/overlay-containers/fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45/userdata/oci-log -t --conmon-pidfile /run/user/1000/containers/overlay-containers/fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /home/tobias/.local/share/containers/storage --exit-command-arg --runroot --exit-command-arg /run/user/1000/containers --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/user/1000/libpod/tmp --exit-command-arg --network-config-dir --exit-command-arg  --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /home/tobias/.local/share/containers/storage/volumes --exit-command-arg --db-backend --exit-command-arg boltdb --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45
3044586 pts/0    Ss+    0:00  \_ /bin/bash

(Yes, that is the entire tree for that conmon)

I don't think we can 100% rule out a DHCP client as the culprit yet - it doesn't seem like one is running persistently, but it's possible one ran transiently at the point things broke. So, I think it's worth checking what effect running a DHCP client would have:

3. In the container in the broken state, try running `dhclient -v eno1` manually.  Does this give any errors?  Is network connectivity restored after it completes?

None of these has dhclient installed default, i'll add it to the "restart, install, wait" list. Bit i dont think it'll do much because:

4. In a container in unbroken state (but based on the same image), try the same thing.  Does this give any errors?  Does it break network connectivity?

As root inside the new fedora container:

RTNETLINK answers: Operation not permitted
Open a socket for LPF: Operation not permitted
rhatdan commented 1 year ago

Any chance a software update involving firewalld or iptables is happening, and strikes the iptables rules? Or any other tool that might muck around with them?

dgibson commented 1 year ago
1. Can you give a general idea of what the container is for, and what it does?  Maybe this will give some clues as to where to focus our investigation.

The fedora Container is just a fedora:latest container running bash and a minimal webserver, specifically to reproduce this.

Ah, ok. So have you succeeded in reproducing in this test fedora container without any particular app? It wasn't previously clear to me that this has happened with multiple different container images.

What exactly is the minimal webserver you're using?

The mumble Container only runs https://hub.docker.com/r/mumblevoip/mumble-server - to run, well, mumble. Listens on tcp and udp - it was how i was first made aware of this bug as users could not connect anymore, at random. I used pasta to have the container see the real user ips without host networking - allowed the internal abuse limits to not apply to "all" ips since rootless changed them all to localhost of course.

Jellyfin is an instance of https://hub.docker.com/r/jellyfin/jellyfin - all three of thse are single purpose containers pretty much to run a single app with no other automations, management tools etc. added outside of whats shipped. And atleast these three dont even use pods or similar.

Ok, understood.

2. Can you provide the output from `ps afx` from within the container?  This is on the off chance that there's something non-obvious to you that stands out to me or my colleagues as a clue.

None of these has ps installed default, i'll add it to the "restart, install, wait" list.

If the problem has reproduced on the fedora container, then I don't think we need this info from the others.

However from the host for the fedora container conmon:

3044584 ?        Ss     0:00 /usr/bin/conmon --api-version 1 -c fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45 -u fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45 -r /usr/bin/crun -b /home/tobias/.local/share/containers/storage/overlay-containers/fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45/userdata -p /run/user/1000/containers/overlay-containers/fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45/userdata/pidfile -n fedotest --exit-dir /run/user/1000/libpod/tmp/exits --full-attach -l journald --log-level warning --syslog --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/user/1000/containers/overlay-containers/fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45/userdata/oci-log -t --conmon-pidfile /run/user/1000/containers/overlay-containers/fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /home/tobias/.local/share/containers/storage --exit-command-arg --runroot --exit-command-arg /run/user/1000/containers --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/user/1000/libpod/tmp --exit-command-arg --network-config-dir --exit-command-arg  --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /home/tobias/.local/share/containers/storage/volumes --exit-command-arg --db-backend --exit-command-arg boltdb --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg fc893e44d9b601b3edf1f73ad7b400b25138788d169479cc5f673e6cc3248f45
3044586 pts/0    Ss+    0:00  \_ /bin/bash

(Yes, that is the entire tree for that conmon)

Ok, thanks.

I don't think we can 100% rule out a DHCP client as the culprit yet - it doesn't seem like one is running persistently, but it's possible one ran transiently at the point things broke. So, I think it's worth checking what effect running a DHCP client would have:

3. In the container in the broken state, try running `dhclient -v eno1` manually.  Does this give any errors?  Is network connectivity restored after it completes?

None of these has dhclient installed default, i'll add it to the "restart, install, wait" list. Bit i dont think it'll do much because:

Ok. Again, as long as the problem has reproduced in the fedora container, I don't think we need to try this anywhere else.

4. In a container in unbroken state (but based on the same image), try the same thing.  Does this give any errors?  Does it break network connectivity?

As root inside the new fedora container:

RTNETLINK answers: Operation not permitted
Open a socket for LPF: Operation not permitted

Ah, drat. I forgot that's how the permissions worked with podman. Which... come to think of it rather nixes my theory that something within the container is going rogue. Given the permission errors above, anything that's doing that should hit the same permission error.

So.. something outside the container messing with its network configuration. pasta itself is the obvious candidate, but as noted above, I don't see how anything in there could cause this symptom. I think I'm going to have to come up with a systemtap script or similar that will find things touching the container's netlink interface. That will take a little research, in the meantime here's some more things we can try:

  1. Can you give the exact version / UUID of the fedora container image you're using? (That will give us the best chance to reproduce it here, which I've started trying)
  2. Can you reproduce the problem if you don't include the -p option to podman, and simply run a shell in the fedora container? This might eliminate a few more possibilities.
  3. Can you run another fedora container, install the iproute package within it and leave running the command: ip -ts monitor dev eno0 (replace eno0 with the name of the container's external network interface if it's different). This, alas, won't show us what is messing with netlink, but it will show us what netlink operations are happening, which might provide some clues.
Luap99 commented 1 year ago

Sorry I should have mentioned that earlier, unless the container is started with --cap-add NET_ADMIN or --privileged (adds all caps) then the container process will not be allowed to modify the net namespace.

As for monitoring the netns the best way would be to join it with that command:

podman unshare nsenter --net=$(podman container inspect --format {{.NetworkSettings.SandboxKey}} <NAME>)
# replace <NAME> with your actual container name or id

This gives you full capabilities for that netns while staying on the host fs so you do not need to install ip and other utils in the container.

waffshappen commented 1 year ago

Any chance a software update involving firewalld or iptables is happening, and strikes the iptables rules? Or any other tool that might muck around with them?

Not automatically on that machine. dnf-automatic is set up but only to automatically pull the packages so they're ready when i'm ready.

Ah, ok. So have you succeeded in reproducing in this test fedora container without any particular app? It wasn't previously clear to me that this has happened with multiple different container images.

What exactly is the minimal webserver you're using?

Both apache (default page) and simply:

while true; do {   echo -ne "HTTP/1.0 200 OK\r\nContent-Length: $(wc -c <index.html)\r\n\r\n";   cat index.html; } | nc -l -p 8098 ; \ ; done

(With a minimal index.html next to it) break this way.

1. Can you give the exact version / UUID of the `fedora` container image you're using? (That will give us the best chance to reproduce it here, which I've started trying)

The specific image id was ad2032316c2664fe02873afaf98e6ab5323d1980d4b99d8de55848cd6ffae1f8 but it has persisted across previous pulls and entirely different os base images (mumbles default build on ubuntu for example). I can try with the new image release - but since it happened in other containers with other distros i didnt think it'd make a difference.

2. Can you reproduce the problem if you don't include the `-p` option to podman, and simply run a shell in the fedora container?  This might eliminate a few more possibilities.

This, also, looses its connectivity.

3. Can you run another `fedora` container, install the `iproute` package within it and leave running the command: `ip -ts monitor dev eno0` (replace `eno0` with the name of the container's external network interface if it's different).  This, alas, won't show us _what_ is messing with netlink, but it will show us what netlink operations are happening, which might provide some clues.

I have re-pulled the :latest fedora and tested again, took ~9 hours, some values shortened:

ip -ts monitor dev eno1
[2023-08-09T10:34:43.931476] 10.x.0.1 lladdr 94:18[host] STALE 
[2023-08-09T17:27:49.851562] Deleted 2: eno1    inet 10.x.0.4/24 brd 10.x.0.255 scope global dynamic noprefixroute eno1
       valid_lft 0sec preferred_lft 0sec
[2023-08-09T17:27:49.855557] Deleted broadcast 10.x.0.255 table local proto kernel scope link src 10.x.0.4 
[2023-08-09T17:27:49.857015] Deleted local 10.x.0.4 table local proto kernel scope host src 10.x.0.4 
[2023-08-09T17:27:49.857062] Deleted 10.x.0.1 lladdr 94:18[host] STALE 
[2023-08-09T17:42:18.203511] Deleted 2: eno1    inet6 2003:ed:[prefix]/128 scope global dynamic noprefixroute 
       valid_lft 0sec preferred_lft 0sec
[2023-08-09T17:42:18.203986] Deleted local 2003:ed:[prefix] table local proto kernel metric 0 pref medium

This specific host is running behind openwrt, the other affected machine is a hetzner root server - in case that changes anything.

As for monitoring the netns the best way would be to join it with that command:

I'll do that next Time, thanks!

dgibson commented 1 year ago
ip -ts monitor dev eno1
[2023-08-09T10:34:43.931476] 10.x.0.1 lladdr 94:18[host] STALE 
[2023-08-09T17:27:49.851562] Deleted 2: eno1    inet 10.x.0.4/24 brd 10.x.0.255 scope global dynamic noprefixroute eno1
       valid_lft 0sec preferred_lft 0sec
[2023-08-09T17:27:49.855557] Deleted broadcast 10.x.0.255 table local proto kernel scope link src 10.x.0.4 
[2023-08-09T17:27:49.857015] Deleted local 10.x.0.4 table local proto kernel scope host src 10.x.0.4 
[2023-08-09T17:27:49.857062] Deleted 10.x.0.1 lladdr 94:18[host] STALE 
[2023-08-09T17:42:18.203511] Deleted 2: eno1    inet6 2003:ed:[prefix]/128 scope global dynamic noprefixroute 
       valid_lft 0sec preferred_lft 0sec
[2023-08-09T17:42:18.203986] Deleted local 2003:ed:[prefix] table local proto kernel metric 0 pref medium

Well, the addresses sure are being deleted. Alas, as I feared, seeing what and when it's happening isn't providing many clues as to who's doing it and why.

I'm working on writing a systemtap script which will be able to log what is performing these address removals, unfortunately I'm having some trouble getting it working (especially since I've encountered this bug along the way).

While I'm working on that, here are some more things we can try:

  1. What distro is running on your host? What kernel version is it running? These will help me make a systemtap script that works for your system.
  2. Are there any routing daemons or VPNs running on your host? These shouldn't interfere with the container obviously, but they are at least candidates for manipulating addresses and routes.
  3. Can you try the following as another reproduction attempt:

The idea here is to see if the same problem occurs on a "bare" pasta instance, or if the additional steps of podman creating the full container are somehow triggering the problem.

My own attempts to reproduce are still running. No signs of the problem so far, after a bit under a day. I'll leave it running, but at this point I strongly suspect something different on your system is triggering the problem.

This specific host is running behind openwrt, the other affected machine is a hetzner root server - in case that changes anything.

I don't think that's relevant at this stage, but good to know just in case.

As for monitoring the netns the best way would be to join it with that command:

I'll do that next Time, thanks!

Given the new working theory, I don't think this is necessary for the current steps. However, it does allow the possibility of a (poor) interim workaround: for your "real" containers that encounter this problem you could log in that way and manually reconfigure the network.

waffshappen commented 1 year ago

I'm working on writing a systemtap script which will be able to log what is performing these address removals, unfortunately I'm having some trouble getting it working (especially since I've encountered this bug along the way).

Ah, of course when i come across a bug everything is maximum cursed in some way, i am getting used to that. ^^

1. What distro is running on your host?  What kernel version is it running? These will help me make a systemtap script that works for your system.

Fedora 38, Both 6.3.12-200.fc38.x86_64 (Locally accessible only) and 6.4.8-200.fc38.x86_64. One Machine i am holding back from all Changes so it can be debugged just in case the bug gets fixed on a newer Kernel somehow.

2. Are there any routing daemons or VPNs running on your host?  These shouldn't interfere with the container obviously, but they are at least candidates for manipulating addresses and routes.

On the Hetzner: Yes, Wireguard on the host directly (P2P Site Network for sharing home access to the 10.x.0.0/24 network on each side and allowing access to my self hosted content (like nextcloud) over a vpn).

On the local machine: Not directly, no. Wireguard is running on openwrt infront of it instead. However the local Machine runs libvirtd with 1 vm instead.

3. Can you try the following as another reproduction attempt:

* Run `pasta --config-net` as the same user you run the podman containers as.

* This will bring up a "root" shell (actually only root within a new user namespace, similar to a container).

Slight (selinux) issue, as user:

pasta --config-net
Couldn't create user namespace: Permission denied

And as root:

pasta --config-net   
Don't run as root. Changing to nobody...
Can't set GID to 65534: Operation not permitted
* Verify that you have basic network connectivity within that shell

* Run `ip -ts monitor dev eno1` within that shell, to monitor changes to its network configuration

* Leave running for 24-48 hours

To be fair that was selinux blocking it being called directly. Without it active, spawning it and enabling it again it runs and has connectivity, i'll leave the shell open with it monitoring.

Given the new working theory, I don't think this is necessary for the current steps. However, it does allow the possibility of a (poor) interim workaround: for your "real" containers that encounter this problem you could log in that way and manually reconfigure the network.

I've bitten the bullet of having the affected containers that need to expose ports directly running with host networking or falling back to slirp and trying to avoid triggering its bugs for those that do fine with "all access is from localhost".

dgibson commented 1 year ago

I'm working on writing a systemtap script which will be able to log what is performing these address removals, unfortunately I'm having some trouble getting it working (especially since I've encountered this bug along the way).

Ah, of course when i come across a bug everything is maximum cursed in some way, i am getting used to that. ^^

Well, based on some further developments I'll relate below, alas, I cannot but agree.

1. What distro is running on your host?  What kernel version is it running? These will help me make a systemtap script that works for your system.

Fedora 38, Both 6.3.12-200.fc38.x86_64 (Locally accessible only) and 6.4.8-200.fc38.x86_64. One Machine i am holding back from all Changes so it can be debugged just in case the bug gets fixed on a newer Kernel somehow.

The original pasta connectivity bug? Or the systemtap bug? Or something else?

The good news is that I'm also running Fedora 38 with a similar kernel, so the chances are if I can get a systemtap script working locally it should work for you too. The bad news is that the 6.4 kernels seem to be the ones that aren't working with systemtap currently, so you're likely to encounter the same problem

The better news is that a draft fix for the systemtap bug was posted. The worse news is that, at least for me, it now fails differently: instead of a compile error I get a kernel oops

2. Are there any routing daemons or VPNs running on your host?  These shouldn't interfere with the container obviously, but they are at least candidates for manipulating addresses and routes.

On the Hetzner: Yes, Wireguard on the host directly (P2P Site Network for sharing home access to the 10.x.0.0/24 network on each side and allowing access to my self hosted content (like nextcloud) over a vpn).

On the local machine: Not directly, no. Wireguard is running on openwrt infront of it instead. However the local Machine runs libvirtd with 1 vm instead.

Ok, good to know. Probably not the culprit, based on that.

3. Can you try the following as another reproduction attempt:

* Run `pasta --config-net` as the same user you run the podman containers as.

* This will bring up a "root" shell (actually only root within a new user namespace, similar to a container).

Slight (selinux) issue, as user:

pasta --config-net
Couldn't create user namespace: Permission denied

Ah, right. That's a known issue with the selinux profile and recent kernels - @sbrivio-rh is working on it, but has had to battle through some additional complications.

And as root:

pasta --config-net   
Don't run as root. Changing to nobody...
Can't set GID to 65534: Operation not permitted

Right, pasta explicitly avoids running as root.

* Verify that you have basic network connectivity within that shell

* Run `ip -ts monitor dev eno1` within that shell, to monitor changes to its network configuration

* Leave running for 24-48 hours

To be fair that was selinux blocking it being called directly. Without it active, spawning it and enabling it again it runs and has connectivity, i'll leave the shell open with it monitoring.

Great, thanks

Given the new working theory, I don't think this is necessary for the current steps. However, it does allow the possibility of a (poor) interim workaround: for your "real" containers that encounter this problem you could log in that way and manually reconfigure the network.

I've bitten the bullet of having the affected containers that need to expose ports directly running with host networking or falling back to slirp and trying to avoid triggering its bugs for those that do fine with "all access is from localhost".

sbrivio-rh commented 1 year ago

I just realised:

ip -ts monitor dev eno1
[2023-08-09T10:34:43.931476] 10.x.0.1 lladdr 94:18[host] STALE 
[2023-08-09T17:27:49.851562] Deleted 2: eno1    inet 10.x.0.4/24 brd 10.x.0.255 scope global dynamic noprefixroute eno1
       valid_lft 0sec preferred_lft 0sec

...that the valid lifetime at this point is 0. The address is dynamic in the sense that it's not permanent. When pasta adds addresses, by choice, it doesn't use the IFA_F_PERMANENT netlink flag because, strictly speaking, that means "configured by the user", and pasta is not the user... and we didn't notice any issue with that.

But maybe there's some new (unintended, I guess, I couldn't find anything relevant in recent kernel commits) behaviour implied by the kernel which makes IFA_F_PERMANENT necessary.

@waffshappen, could you try something like this snippet (patch applies on top of current git HEAD)?

diff --git a/netlink.c b/netlink.c
index 1226379..f7b2907 100644
--- a/netlink.c
+++ b/netlink.c
@@ -604,6 +604,7 @@ int nl_addr_set(int s, unsigned int ifi, sa_family_t af,
                .ifa.ifa_index     = ifi,
                .ifa.ifa_prefixlen = prefix_len,
                .ifa.ifa_scope     = RT_SCOPE_UNIVERSE,
+               .ifa.ifa_flags     = IFA_F_PERMANENT,
        };
        ssize_t len;

@@ -611,7 +612,7 @@ int nl_addr_set(int s, unsigned int ifi, sa_family_t af,
                size_t rta_len = RTA_LENGTH(sizeof(req.set.a6.l));

                /* By default, strictly speaking, it's duplicated */
-               req.ifa.ifa_flags = IFA_F_NODAD;
+               req.ifa.ifa_flags |= IFA_F_NODAD;

                len = offsetof(struct req_t, set.a6) + sizeof(req.set.a6);

if it's too much of a hassle for you to try building with this, I can also provide a build -- let me know.

waffshappen commented 1 year ago

The original pasta connectivity bug? Or the systemtap bug? Or something else?

pasta

The better news is that a draft fix for the systemtap bug was posted. The worse news is that, at least for me, it now fails differently: instead of a compile error I get a kernel oops

Yeah, maximum cursed as usual ^^

To be fair that was selinux blocking it being called directly. Without it active, spawning it and enabling it again it runs and has connectivity, i'll leave the shell open with it monitoring.

In the shell spawned with just pasta --config-net the same bug occurs:

ip -ts monitor dev eno1
[2023-08-10T15:28:26.971513] Deleted 2: eno1    inet 10.x.0.4/24 brd 10.x.0.255 scope global dynamic noprefixroute eno1 valid_lft 0sec preferred_lft 0sec
[2023-08-10T15:28:26.971816] Deleted broadcast 10.x.0.255 table local proto kernel scope link src 10.x.0.4 
[2023-08-10T15:28:26.974232] Deleted local 10.x.0.4 table local proto kernel scope host src 10.x.0.4 
[2023-08-10T15:28:26.974250] Deleted 10.x.0.1 lladdr 94:18[host] STALE 
[2023-08-10T19:19:40.187473] Deleted 2: eno1    inet6 2003:ed:[prefix]/128 scope global dynamic noprefixroute valid_lft 0sec preferred_lft 0sec
[2023-08-10T19:19:40.187719] Deleted local 2003:ed:[prefix] table local proto kernel metric 0 pref medium
[2023-08-11T02:55:20.923517] Deleted 2: eno1    inet6 2003:ed:[prefix]/64 scope global dynamic noprefixroute valid_lft 0sec preferred_lft 0sec
[2023-08-11T02:55:20.923779] Deleted local 2003:ed:[prefix] table local proto kernel metric 0 pref medium
 diff --git a/netlink.c b/netlink.c

I'll try building that and running it, i'll let you know what happens (since i can reproduce the bug with just pasta i'll spawn it from the build output without trying to get podman and selinux to cooperate with the result)

Also does this mean that pasta doesnt handle Address Changes on the host as well? Or are new Address Events handled already? My external Addresses, especially ipv6 since that propagates through the entire Network, change constantly at home. (And i guess my ipv4 become stale for pasta as the default lease time runs out?)

sbrivio-rh commented 1 year ago

Also does this mean that pasta doesnt handle Address Changes on the host as well?

At the moment, address changes are handled implicitly, in the sense that when addresses change on the host, pasta will just naturally switch to NAT (assuming default options, that is, with host addresses copied to the containers). However:

Or are new Address Events handled already? My external Addresses, especially ipv6 since that propagates through the entire Network, change constantly at home. (And i guess my ipv4 become stale for pasta as the default lease time runs out?)

...we had feature requests to monitor IPv6 prefix changes, via netlink, and update the prefix in the container accordingly. @dgibson is working on a more flexible model for forwarding and address translation, once that part is done we'll be able to support this.

For IPv4 we could probably support this with a netmask. At the moment, if the address on your host expires, pasta will just use the new address like any other process running there, but the address in the container should be preserved.

dgibson commented 1 year ago

Also does this mean that pasta doesnt handle Address Changes on the host as well?

At the moment, address changes are handled implicitly, in the sense that when addresses change on the host, pasta will just naturally switch to NAT (assuming default options, that is, with host addresses copied to the containers). However:

To elaborate on this. At present we don't monitor for changes to the host addresses. We have considered it for various reasons, and may do so in future. That doesn't mean that a host address change will break container connectivity, though: the container won't see an address change, but it will still be able to make connections outward and they'll be implicitly NATted. For inbound connections it depends, if pasta's forwarded ports aren't bound to a specific address, it will again implicitly NAT. If they are, and that address changes on the host, then as you'd expect you'll no longer be able to access that forwarding.

Or are new Address Events handled already? My external Addresses, especially ipv6 since that propagates through the entire Network, change constantly at home. (And i guess my ipv4 become stale for pasta as the default lease time runs out?)

...we had feature requests to monitor IPv6 prefix changes, via netlink, and update the prefix in the container accordingly. @dgibson is working on a more flexible model for forwarding and address translation, once that part is done we'll be able to support this.

Actually, there's less overlap between the forwarding model and handling address updates than you might think. The new forwarding option would certainly give a lot more flexibility with how exactly we'd handle a changing host address, though.

For IPv4 we could probably support this with a netmask. At the moment, if the address on your host expires, pasta will just use the new address like any other process running there, but the address in the container should be preserved.

dgibson commented 1 year ago

So, @sbrivio-rh and I discussed this bug yesterday.. and I think we cracked it. Addresses do have a lifetime, seen in the ip addr output as valid_lft and preferred_lft. If the address is set statically / manually, it will be forever but if managed actively (e.g. by DHCP) then it will have a finite lifetime.

We think when pasta is copying address information from the host it's inadvertently copying the lifetimes from the host as well. So if the host has addresses with finite lifetime, they'll have finite lifetime in the guest as well, and eventually expire. However, the guest or container doesn't have the DHCP client or whatever was managing the address on the host, and so it just goes away.

I'm currently working on confirming this and figuring out what to do about it.

dgibson commented 1 year ago

@waffshappen

I've made some changes that I think will fix the problem - essentially it just strips the lifetime information off the host address when copying it to the container. I have a branch here with the revised code. If you could try that out, that would be great.

dgibson commented 1 year ago

I've also entered this in the pasta bugzilla as bug 70 so we have a record there.

waffshappen commented 1 year ago

I've made some changes that I think will fix the problem - essentially it just strips the lifetime information off the host address when copying it to the container. I have a branch here with the revised code. If you could try that out, that would be great.

Testing that does work and the pasta --config-net has not lost Connectivity.

The only weird thing i can see is shortly after a ping attempt from it:

[2023-08-16T18:10:12.955496] 10.x.0.1 lladdr 94:18:[host] PROBE              
[2023-08-16T18:10:12.955625] 10.x.0.1 lladdr 94:18:[host] REACHABLE          
[2023-08-16T18:10:37.019494] 10.x.0.1 lladdr 94:18:[host] STALE

but it works just fine.

I have not tested changing the assigned dhcp ip however.

dgibson commented 1 year ago

I've made some changes that I think will fix the problem - essentially it just strips the lifetime information off the host address when copying it to the container. I have a branch here with the revised code. If you could try that out, that would be great.

Testing that does work and the pasta --config-net has not lost Connectivity.

Excellent!

The only weird thing i can see is shortly after a ping attempt from it:

[2023-08-16T18:10:12.955496] 10.x.0.1 lladdr 94:18:[host] PROBE              
[2023-08-16T18:10:12.955625] 10.x.0.1 lladdr 94:18:[host] REACHABLE          
[2023-08-16T18:10:37.019494] 10.x.0.1 lladdr 94:18:[host] STALE

but it works just fine.

Right, I think that's some unrelated renewal stuff.

I have not tested changing the assigned dhcp ip however.

Ok. You mean the -a option to pasta I assume? By all means test this, but I don't think it will be affected. If my understanding of the cause of this problem is correct, when using -a we wouldn't have hit this problem in the first place because we simply assign that address to the guest, rather than copying it (with all attributes) from the host which is what caused the problem.

Luap99 commented 1 year ago

Feel free to continue the discussion but be since the patch is applied (https://passt.top/passt/commit/?id=da0aeb9080c9d2e39b2ff600a9b2b03046ac219d), closing.