Rootless podman 5.2 with pasta now publishes processes which only listen on 127.0.0.1 in the container

containers / podman

Podman: A tool for managing OCI containers and pods.

https://podman.io

Apache License 2.0

23.27k stars 2.37k forks source link

Rootless podman 5.2 with pasta now publishes processes which only listen on 127.0.0.1 in the container #24045

Open adelton opened 6 days ago

adelton commented 6 days ago

Issue Description

With previous rootless podman setups, having a process listen on 127.0.0.1 in the container and publishing that port to the host did not expose that process to the host. Or rather, while connection could be made, it was killed right away (Connection reset by peer when tested with curl). This was very similar to the the rootful podman behaviour (Couldn't connect to server).

With podman-5.2.2-1.fc40.x86_64 with passt-0^20240906.g6b38f07-1.fc40.x86_64 I see a change of behaviour -- the process in the container is reachable on the published port on the host even if the process in the container is supposed to only listen on 127.0.0.1.

Steps to reproduce the issue

Have Dockerfile to get us some server where we can easily control where it listens:

FROM registry.fedoraproject.org/fedora
RUN dnf install -y python3-django
RUN django-admin startproject mysite
WORKDIR /mysite
ENTRYPOINT [ "python3", "manage.py", "runserver" ]

podman build -t localhost/django .
podman rm -f django ; podman run --name django -d -p 8000:8000 localhost/django 127.0.0.1:8000
curl -s http://127.0.0.1:8000/ | head
In case the above curl does not show anything, curl http://127.0.0.1:8000/

Describe the results you received

With rootless podman-5.2.2-1.fc40.x86_64 with passt-0^20240906.g6b38f07-1.fc40.x86_64 I see


<!doctype html>

<html lang="en-us" dir="ltr">
    <head>
        <meta charset="utf-8">
        <title>The install worked successfully! Congratulations!</title>
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <style>
          html {

Describe the results you expected

With rootless podman-4.9.4-1.fc39.x86_64 with rootlessport I see

curl: (56) Recv failure: Connection reset by peer

With rootful setup, both podman-4.9.4-1.fc39.x86_64 and podman-5.2.2-1.fc40.x86_64, I get

curl: (7) Failed to connect to 127.0.0.1 port 8000 after 0 ms: Couldn't connect to server

podman info output

host:
  arch: amd64
  buildahVersion: 1.37.2
  cgroupControllers:
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.12-2.fc40.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.12, commit: '
  cpuUtilization:
    idlePercent: 97.84
    systemPercent: 0.53
    userPercent: 1.63
  cpus: 2
  databaseBackend: sqlite
  distribution:
    distribution: fedora
    version: "40"
  eventLogger: journald
  freeLocks: 2047
  hostname: redacted.domain.com
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
  kernel: 6.10.10-200.fc40.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 1263239168
  memTotal: 3036377088
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.12.2-2.fc40.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.12.2
    package: netavark-1.12.2-1.fc40.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.12.2
  ociRuntime:
    name: crun
    package: crun-1.17-1.fc40.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.17
      commit: 000fa0d4eeed8938301f3bcf8206405315bc1017
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20240906.g6b38f07-1.fc40.x86_64
    version: |
      pasta 0^20240906.g6b38f07-1.fc40.x86_64
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: false
    path: /run/user/1000/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 3035623424
  swapTotal: 3035623424
  uptime: 0h 36m 13.00s
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
store:
  configFile: /home/test/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/test/.local/share/containers/storage
  graphRootAllocated: 16039018496
  graphRootUsed: 2594983936
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 6
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/test/.local/share/containers/storage/volumes
version:
  APIVersion: 5.2.2
  Built: 1724198400
  BuiltTime: Wed Aug 21 02:00:00 2024
  GitCommit: ""
  GoVersion: go1.22.6
  Os: linux
  OsArch: linux/amd64
  Version: 5.2.2

Podman in a container

Privileged Or Rootless

Rootless

Upstream Latest Release

Additional environment details

Tested with stock Fedora packages.

Additional information

Deterministic on a fresh Fedora server installation.

Luap99 commented 5 days ago

@sbrivio-rh @dgibson PTAL

sbrivio-rh commented 5 days ago

With podman-5.2.2-1.fc40.x86_64 with passt-0^20240906.g6b38f07-1.fc40.x86_64 I see a change of behaviour -- the process in the container is reachable on the published port on the host even if the process in the container is supposed to only listen on 127.0.0.1.

If the process in the container listens on 127.0.0.1, it will only be accessible via 127.0.0.1 on the loopback interface, and not via the external interface:

$ (sleep 1; : | nc 127.0.0.1 1337) & ./pasta --config-net -- sh -c '( socat TCP-LISTEN:1337,bind=127.0.0.1 STDOUT & tshark -Pi lo)'
[2] 2036258
Running as user "root" and group "root". This could be dangerous.
Capturing on 'Loopback: lo'
 ** (tshark:2) 13:38:02.726935 [Main MESSAGE] -- Capture started.
 ** (tshark:2) 13:38:02.726976 [Main MESSAGE] -- File: "/tmp/wireshark_loDXLVU2.pcapng"
2024/09/24 13:38:03 socat[3] W address is opened in read-write mode but only supports write-only
    1 0.000000000    127.0.0.1 → 127.0.0.1    TCP 74 57870 → 1337 [SYN] Seq=0 Win=65535 Len=0 MSS=65495 SACK_PERM TSval=358694077 TSecr=0 WS=4096
    2 0.000006130    127.0.0.1 → 127.0.0.1    TCP 74 1337 → 57870 [SYN, ACK] Seq=0 Ack=1 Win=65483 Len=0 MSS=65495 SACK_PERM TSval=358694077 TSecr=358694077 WS=4096
    3 0.000013970    127.0.0.1 → 127.0.0.1    TCP 66 57870 → 1337 [ACK] Seq=1 Ack=1 Win=65536 Len=0 TSval=358694077 TSecr=358694077

and this is the expected behaviour for pasta, because it handles both the loopback path (same as the rootlesskit port forwarder) in the container as well as the non-loopback path (same as the slirp4netns port forwarder).

So, if no address is given explicitly for --publish / -p, pasta binds the port to any address (which looks rather convenient and actually correct to me), and then picks the appropriate container interface depending on where the packet comes from.

But I see now that rootlessport and slirp4netns don't actually map host loopback traffic, so this is surely inconsistent.

I thought that since rootlessport uses 127.0.0.1 and ::1 as source addresses, it would also map the connections that should have those as source addresses, but no, it binds only all the other ones.

I guess we have four options:

document this inconsistency in both Podman and pasta documentation. Easy, but it's still inconsistent (especially compared to containers started as root)
document this inconsistency and change the behaviour for containers started as root, so that the inconsistency will become less relevant over time. I'm not sure how the bridge thing works and if it's possible to change this though
also accept IP address exclusions in pasta's --tcp-ports / -t and --udp-ports / -u options (not just port exclusion), and, from Podman, exclude loopback addresses by default, so that users can override this behaviour by explicitly listening to 0.0.0.0. This is slighly complicated by the fact that IPv4 has a whole /8 subnet of loopback addresses, so it's not feasible to implement this correctly, strictly speaking: pasta would bind to all available local addresses, without binding to loopback addresses at all if just one is excluded. If you have 192.0.2.1 and 198.51.100.1 configured on the host, and you exclude 127.0.0.2, only 192.0.2.1 and 192.51.100.1 will be mapped, but not the rest of 127.0.0.0/8
implement a completely different option for loopback mappings from the host, say, --loopback-tcp-ports, and -t wouldn't imply it. Relatively easy, but it's very impractical for users who already got used to the fact that -t magically works for both local host and remote hosts

What do you all think?

Luap99 commented 5 days ago

I thought that since rootlessport uses 127.0.0.1 and ::1 as source addresses, it would also map the connections that should have those as source addresses, but no, it binds only all the other ones.

There was CVE-2021-20199 about that as some application somehow trust localhost (even though this is not secure at all as all users can access localhost) so yeah since then we always make sure the source ip is not 127.0.0.1. Of course in that case it was bad because even remote connections appeared as 127.0.0.1. For pasta if only 127.0.0.1 on the host maps to 127.0.0.1 in the container then this is likely not a big deal.

document this inconsistency and change the behaviour for containers started as root, so that the inconsistency will become less relevant over time. I'm not sure how the bridge thing works and if it's possible to change this though

It is not possible to change this, if the application bind 127.0.0.1 then there is simply no way to get packages there from another namespace AFAICT (well without a user space proxy). As we forward via firewall the packages always go to the eth0 address in the container.

Overall I think it is far assumption that binding to 127.0.0.1 means no external connections should be made to that address and pasta breaks this assumption by allowing connections from the host namespace 127.0.0.1. I guess it should not matter because a user still has to forward the port via podman/pasta in the first place and if the application listens to 127.0.0.1 this likely is a miss configuration and the user part. The only real point here could be that one app bind to 127.0.0.1:X and another binds to the interface address (i.e. 192.168.1.2:X) but this seems rather unlikely for container scenarios.

also accept IP address exclusions in pasta's --tcp-ports / -t and --udp-ports / -u options (not just port exclusion), and, from Podman, exclude loopback addresses by default, so that users can override this behaviour by explicitly listening to 0.0.0.0. This is slighly complicated by the fact that IPv4 has a whole /8 subnet of loopback addresses, so it's not feasible to implement this correctly, strictly speaking: pasta would bind to all available local addresses, without binding to loopback addresses at all if just one is excluded. If you have 192.0.2.1 and 198.51.100.1 configured on the host, and you exclude 127.0.0.2, only 192.0.2.1 and 192.51.100.1 will be mapped, but not the rest of 127.0.0.0/8

That doesn't seem reasonable to me.

The original report mentions that this used to work so can you clarify if pasta changed this behaviour or if pasta always worked that way?

sbrivio-rh commented 5 days ago

The original report mentions that this used to work so can you clarify if pasta changed this behaviour or if pasta always worked that way?

It worked that way since the very beginning of pasta. The term of comparison is, quoting, "podman-4.9.4-1.fc39.x86_64 with rootlessport".

sbrivio-rh commented 5 days ago

Overall I think it is far assumption that binding to 127.0.0.1 means no external connections should be made to that address and pasta breaks this assumption by allowing connections from the host namespace 127.0.0.1.

...for some definitions of "external", yes.

I guess it should not matter because a user still has to forward the port via podman/pasta in the first place and if the application listens to 127.0.0.1 this likely is a miss configuration and the user part. The only real point here could be that one app bind to 127.0.0.1:X and another binds to the interface address (i.e. 192.168.1.2:X) but this seems rather unlikely for container scenarios.

Right. In that case, by the way, a user can still bind ports to specific interfaces (using pasta-only options at the moment).

also accept IP address exclusions in pasta's --tcp-ports / -t and --udp-ports / -u options (not just port exclusion), and, from Podman, exclude loopback addresses by default, so that users can override this behaviour by explicitly listening to 0.0.0.0. This is slighly complicated by the fact that IPv4 has a whole /8 subnet of loopback addresses, so it's not feasible to implement this correctly, strictly speaking: pasta would bind to all available local addresses, without binding to loopback addresses at all if just one is excluded. If you have 192.0.2.1 and 198.51.100.1 configured on the host, and you exclude 127.0.0.2, only 192.0.2.1 and 192.51.100.1 will be mapped, but not the rest of 127.0.0.0/8

That doesn't seem reasonable to me.

Yes another option, perhaps more reasonable, would be to implement an option disabling "spliced" inbound connections altogether (something like an explicit, reversed, -T none). That doesn't break Podman and pasta users relying on the current behaviour, but gives the possibility to keep ports private without having to specify interfaces or addresses for each one.

Luap99 commented 5 days ago

Overall I think it is far assumption that binding to 127.0.0.1 means no external connections should be made to that address and pasta breaks this assumption by allowing connections from the host namespace 127.0.0.1.

...for some definitions of "external", yes.

Right it is arbitrary what external means here. different host or namespace. As long as different host is covered I don't see any security issues so I don't mind how it behaves

I guess it should not matter because a user still has to forward the port via podman/pasta in the first place and if the application listens to 127.0.0.1 this likely is a miss configuration and the user part. The only real point here could be that one app bind to 127.0.0.1:X and another binds to the interface address (i.e. 192.168.1.2:X) but this seems rather unlikely for container scenarios.

Right. In that case, by the way, a user can still bind ports to specific interfaces (using pasta-only options at the moment).

I am talking about the interface inside the container netns, that would totally depend on the application inside not on any podman/pasta options.

also accept IP address exclusions in pasta's --tcp-ports / -t and --udp-ports / -u options (not just port exclusion), and, from Podman, exclude loopback addresses by default, so that users can override this behaviour by explicitly listening to 0.0.0.0. This is slighly complicated by the fact that IPv4 has a whole /8 subnet of loopback addresses, so it's not feasible to implement this correctly, strictly speaking: pasta would bind to all available local addresses, without binding to loopback addresses at all if just one is excluded. If you have 192.0.2.1 and 198.51.100.1 configured on the host, and you exclude 127.0.0.2, only 192.0.2.1 and 192.51.100.1 will be mapped, but not the rest of 127.0.0.0/8

That doesn't seem reasonable to me.

Yes another option, perhaps more reasonable, would be to implement an option disabling "spliced" inbound connections altogether (something like an explicit, reversed, -T none). That doesn't break Podman and pasta users relying on the current behaviour, but gives the possibility to keep ports private without having to specify interfaces or addresses for each one.

I guess there is good reason for the splice path, speed mostly?. I would think most users prefer that.

@adelton I like to understand better your actual use case here. Why are you forwarding the port but then bind to 127.0.0.1 inside and don't want the connection to work?

adelton commented 5 days ago

I was happily using the setup with 127.0.0.1 in the container on my Fedoras because I installed pasta a couple of releases back.

And then I spent three hours investigating why the thing which does work on my Fedoras (connect to that container from the host) does not work on GitHub Actions Ubuntu runners. Searching around and in man pages did not suggest it should be happening. It's only when I got a fresh Fedora 39 VM and tried the setup from scratch when I got the difference in behaviour demonstrated.

So it's not as much what I want, I frankly don't mind the rootless pasta behaviour. It's mainly the inconsistency with both the rootful and rootlessport behaviour that has bitten me. And given this could lead to some endpoints now being exposed where they previously were not, so a potential security overlap, I thought I'd report it as an issue. I guess some note in the documentation would work if functional parity with rootful setups is not desired or not practical.

sbrivio-rh commented 5 days ago

Yes another option, perhaps more reasonable, would be to implement an option disabling "spliced" inbound connections altogether (something like an explicit, reversed, -T none). That doesn't break Podman and pasta users relying on the current behaviour, but gives the possibility to keep ports private without having to specify interfaces or addresses for each one.

I guess there is good reason for the splice path, speed mostly?. I would think most users prefer that.

Yes, that, as we get pretty much host-native throughput on that path.

Maybe we'll achieve something similar with VDUSE which might make that more or less obsolete (the tap interface is quite a hurdle for performance), but it will take time.

I was happily using the setup with 127.0.0.1 in the container on my Fedoras because I installed pasta a couple of releases back.

And then I spent three hours investigating why the thing which does work on my Fedoras (connect to that container from the host) does not work on GitHub Actions Ubuntu runners. Searching around and in man pages did not suggest it should be happening. It's only when I got a fresh Fedora 39 VM and tried the setup from scratch when I got the difference in behaviour demonstrated.

Oh, so things that are working now weren't working before. The inconsistency stands and needs to be solved somehow, but this is another bit of information showing us that we need to be careful to avoid breaking things.

dgibson commented 4 days ago

[snip]

So, if no address is given explicitly for --publish / -p, pasta binds the port to any address (which looks rather convenient and actually correct to me), and then picks the appropriate container interface depending on where the packet comes from.

But I see now that rootlessport and slirp4netns don't actually map host loopback traffic, so this is surely inconsistent.

That might be true, but I think it's missing the point. The question is not about host loopback, but about container loopback. The point is that things bound to container loopback are accessible from outside the container, which is indeed surprising. It's mitigated because they're only accessible from host loopback, but it's still odd, and arguably is a security problem because it allows unrelated users on the host to access ports that the container thinks are private to itself.

However, I don't think it's as hard to fix as you outline. This is AFAICT, entirely about "spliced" connections - that's the only way we even can reach loopback bound ports within the container. So, I think all we need to do to fix it is:

Make (inbound) spliced connections connect() to addr_seen instead of to loopback.

Because addr_seen is a local address of the container the traffic will still go over the container's lo interface, but services bound to loopback addresses will no longer respond to it.

There are some real questions about access to the host loopback address via outbound spliced connections, but that's not what this issue is about.

sbrivio-rh commented 4 days ago

[snip]

So, if no address is given explicitly for --publish / -p, pasta binds the port to any address (which looks rather convenient and actually correct to me), and then picks the appropriate container interface depending on where the packet comes from. But I see now that rootlessport and slirp4netns don't actually map host loopback traffic, so this is surely inconsistent.

That might be true, but I think it's missing the point. The question is not about host loopback, but about container loopback.

Well, it's about both in the sense I meant (and thought was desirable... and maybe it even is): you connect to host's loopback, and if it's mapped, it maps to the container's loopback as well. The other way, with -T, you have a symmetric behaviour.

The point is that things bound to container loopback are accessible from outside the container, which is indeed surprising.

Not to me! We splice using the loopback interface in the container. I think it's also implied by the "Handling of local traffic in pasta" section of the man page, even though surely not explicit.

It's mitigated because they're only accessible from host loopback, but it's still odd, and arguably is a security problem because it allows unrelated users on the host to access ports that the container thinks are private to itself.

...not so clearly in my opinion: the ports are exposed with --publish.

However, I don't think it's as hard to fix as you outline. This is AFAICT, entirely about "spliced" connections - that's the only way we even can reach loopback bound ports within the container. So, I think all we need to do to fix it is:
* Make (inbound) spliced connections `connect()` to `addr_seen` instead of to loopback.

That's a nice idea, and I guess it has relatively low chances of breaking things, but they would still break for users who assumed that binding to 127.0.0.1 in the container and exposing that port would make it visible from the host (see https://github.com/containers/podman/issues/24045#issuecomment-2371470404).

Because addr_seen is a local address of the container the traffic will still go over the container's lo interface, but services bound to loopback addresses will no longer respond to it.

Right.

There are some real questions about access to the host loopback address via outbound spliced connections, but that's not what this issue is about.

Sure, that's another matter. But accessing ports bound to a loopback address in the container should be at least optional. I'm almost convinced we can make it an opt-in and it's unlikely that we'll break any usage, but we need to have a way to fix that quickly, in case.

sbrivio-rh commented 4 days ago

Patch series and related discussion at https://archives.passt.top/passt-dev/20240925065436.2064995-1-david@gibson.dropbear.id.au/ by the way.

dgibson commented 4 days ago

[snip]

So, if no address is given explicitly for --publish / -p, pasta binds the port to any address (which looks rather convenient and actually correct to me), and then picks the appropriate container interface depending on where the packet comes from. But I see now that rootlessport and slirp4netns don't actually map host loopback traffic, so this is surely inconsistent.

That might be true, but I think it's missing the point. The question is not about host loopback, but about container loopback.

Well, it's about both in the sense I meant (and thought was desirable... and maybe it even is): you connect to host's loopback, and if it's mapped, it maps to the container's loopback as well. The other way, with -T, you have a symmetric behaviour.

Well, obviously there could be usecases, but I really don't think this would be the expected behaviour. It's so completely unlike any other networking model (physical, rootful, and it seems slirp too). If you really want to share a lo with the host, that seems like a case where you don't want a network namespace.

The point is that things bound to container loopback are accessible from outside the container, which is indeed surprising.

Not to me! We splice using the loopback interface in the container. I think it's also implied by the "Handling of local traffic in pasta" section of the man page, even though surely not explicit.

I don't really think it's implied by that. As my draft patch demonstrates, it certainly need not be the case, even with traffic over lo, and again the pseudo-shared lo model is completely unlike the setups that are likely to form peoples' mental models.

It's mitigated because they're only accessible from host loopback, but it's still odd, and arguably is a security problem because it allows unrelated users on the host to access ports that the container thinks are private to itself.

...not so clearly in my opinion: the ports are exposed with --publish.

Yeah, that also mitigates it. The container could still have different servers running on the same port on loopback and non-loopback addresses. Or it could have a server on 0.0.0.0 that changes behaviour depending on whether getpeername() reports loopback. In those cases -t auto would expose the loopback version to the host, which seems realliy surprising (different from passt as well as all other networking configs).

However, I don't think it's as hard to fix as you outline. This is AFAICT, entirely about "spliced" connections - that's the only way we even can reach loopback bound ports within the container. So, I think all we need to do to fix it is:
* Make (inbound) spliced connections `connect()` to `addr_seen` instead of to loopback.
That's a nice idea, and I guess it has relatively low chances of breaking things, but they would still break for users who assumed that binding to 127.0.0.1 in the container and exposing that port would make it visible from the host (see #24045 (comment)).

Well, sure, but I'd argue that was a flawed assumption that just happened to work because of a pasta bug. Wtiness its total non-portability.

Because addr_seen is a local address of the container the traffic will still go over the container's lo interface, but services bound to loopback addresses will no longer respond to it.

Right.

There are some real questions about access to the host loopback address via outbound spliced connections, but that's not what this issue is about.

Sure, that's another matter. But accessing ports bound to a loopback address in the container should be at least optional. I'm almost convinced we can make it an opt-in and it's unlikely that we'll break any usage, but we need to have a way to fix that quickly, in case.

Sure, it's pretty easy to make it an option.

sbrivio-rh commented 4 days ago

[snip]

So, if no address is given explicitly for --publish / -p, pasta binds the port to any address (which looks rather convenient and actually correct to me), and then picks the appropriate container interface depending on where the packet comes from. But I see now that rootlessport and slirp4netns don't actually map host loopback traffic, so this is surely inconsistent.

That might be true, but I think it's missing the point. The question is not about host loopback, but about container loopback.

Well, it's about both in the sense I meant (and thought was desirable... and maybe it even is): you connect to host's loopback, and if it's mapped, it maps to the container's loopback as well. The other way, with -T, you have a symmetric behaviour.

Well, obviously there could be usecases, but I really don't think this would be the expected behaviour. It's so completely unlike any other networking model (physical, rootful, and it seems slirp too). If you really want to share a lo with the host, that seems like a case where you don't want a network namespace.

It's not shared in general, it's just one port being forwarded, for a specific Layer-4 protocol.

The point is that things bound to container loopback are accessible from outside the container, which is indeed surprising.

Not to me! We splice using the loopback interface in the container. I think it's also implied by the "Handling of local traffic in pasta" section of the man page, even though surely not explicit.

I don't really think it's implied by that. As my draft patch demonstrates, it certainly need not be the case, even with traffic over lo, and again the pseudo-shared lo model is completely unlike the setups that are likely to form peoples' mental models.

...unless you see the "spliced" path as a loopback bypass, which is, at least, what I had in mind when I implemented it, and how I use it sometimes. This plus https://github.com/containers/podman/issues/24045#issuecomment-2371470404 already makes two users...

It's mitigated because they're only accessible from host loopback, but it's still odd, and arguably is a security problem because it allows unrelated users on the host to access ports that the container thinks are private to itself.

...not so clearly in my opinion: the ports are exposed with --publish.

Yeah, that also mitigates it. The container could still have different servers running on the same port on loopback and non-loopback addresses. Or it could have a server on 0.0.0.0 that changes behaviour depending on whether getpeername() reports loopback. In those cases -t auto would expose the loopback version to the host, which seems realliy surprising (different from passt as well as all other networking configs).

True, in this case it's definitely surprising.

However, I don't think it's as hard to fix as you outline. This is AFAICT, entirely about "spliced" connections - that's the only way we even can reach loopback bound ports within the container. So, I think all we need to do to fix it is:
* Make (inbound) spliced connections `connect()` to `addr_seen` instead of to loopback.
That's a nice idea, and I guess it has relatively low chances of breaking things, but they would still break for users who assumed that binding to 127.0.0.1 in the container and exposing that port would make it visible from the host (see #24045 (comment)).
Well, sure, but I'd argue that was a flawed assumption that just happened to work because of a pasta bug. Wtiness its total non-portability.

It's a bug I added tests for... I'd call it a feature, really. Originally, I was thinking of adding something symmetric to -T, which would only work for the loopback bypass, separated from -t, but then I thought that three options to forward ports would be too many.

Because addr_seen is a local address of the container the traffic will still go over the container's lo interface, but services bound to loopback addresses will no longer respond to it.

Right.

There are some real questions about access to the host loopback address via outbound spliced connections, but that's not what this issue is about.

Sure, that's another matter. But accessing ports bound to a loopback address in the container should be at least optional. I'm almost convinced we can make it an opt-in and it's unlikely that we'll break any usage, but we need to have a way to fix that quickly, in case.

Sure, it's pretty easy to make it an option.

Okay, yes, I would be fine with it, and I'm convinced it's an improvement over the current situation especially given the scenario where one might bind the same port to loopback and non-loopback addresses in the container, which is not supported at the moment.

adelton commented 4 days ago

That's a nice idea, and I guess it has relatively low chances of breaking things, but they would still break for users who assumed that binding to 127.0.0.1 in the container and exposing that port would make it visible from the host (see #24045 (comment)).

Well, sure, but I'd argue that was a flawed assumption that just happened to work because of a pasta bug. Wtiness its total non-portability.

I confirm that in my case, rather than explicitly assuming something about the 127.0.0.1 in the container exposure, I did not really think of it when it happened to work on my Fedora setup without modifications. The use of 127.0.0.1 is the default which Kind uses to expose its API server by default, and in my work on https://github.com/adelton/kind-in-pod I just went with the minimal changes to the defaults.