containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.43k stars 2.38k forks source link

Default route confusion when using multiple `--network` options with `macvlan` and `bridge` networks #23984

Open codedump opened 3 weeks ago

codedump commented 3 weeks ago

Issue Description

Consider the following macvlan network bound to a physical interface:

$ podman network inspect v6pub
          ...                                                                                                                                                                                                                     
          "name": "v6pub",                                                                                                                                                                                                         
          "driver": "macvlan",                                                                                                                                                                                                     
          "network_interface": "br42s0",                                                                                                                                                                                           
          "subnets": [                                                                                                                                                                                                             
               {                                                                                                                                                                                                                   
                    "subnet": "2a02:2f0f:2f7:1::/64",                                                                                                                                                                              
                    "gateway": "2a02:2f0f:2f7:1::1",                                                                                                                                                                               
                    "lease_range": {                                                                             
                         "start_ip": "2a02:2f0f:2f7:1:c:0:78:1",                                                 
                         "end_ip": "2a02:2f0f:2f7:1:c:0:78:ffff"                                                 
                    }                                   
               }                                        
          ],                                            
          "ipv6_enabled": true,                         
          "internal": false,
          ...
...

And the following Podman-internal bridge network:

$ podman network inspect v46bridge
          ...                                                                                                                                                                                                                            
          "name": "v46bridge",                                                                                                                                                                                                     
          "driver": "bridge",                                                                                                                                                                                                      
          "network_interface": "podman1",                                                                                                                                                                                          
          "subnets": [                                                                                                                                                                                                             
               {                                                                                                                                                                                                                   
                    "subnet": "10.89.0.0/24",                                                                                                                                                                                      
                    "gateway": "10.89.0.1"                                                                                                                                                                                         
               },                                                                                                                                                                                                                  
               {                                                                                                                                                                                                                   
                    "subnet": "fdb9:4911:2cf5:8ab9::/64",
                    "gateway": "fdb9:4911:2cf5:8ab9::1"
               }
          ],
          "ipv6_enabled": true,
          "internal": false
          ...

And the following container:

$ podman run -ti --rm --network v6pub --network v46bridge:ip6=2a02:2f0f:2f7:1:f::78:137 alpine:latest

Then the routing table will be very confusing (we'll be having a peek at the routing table using the host's "ip" utility via "nsentry", because the container's watered-down busybox-based "ip" doesn't show all the details):

$ LAST_CID=$(crun --root=/run/crun list | grep $(podman ps | tail -n 1 | cut -f1 -d\ ) | cut -f 2 -d\ )

$ nsenter -t $LAST_CID --net ip -6 r s
a02:2f0f:2f7:1::/64 dev eth0 proto kernel metric 256 pref medium
fdb9:4911:2cf5:8ab9::/64 dev eth1 proto kernel metric 256 pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
fe80::/64 dev eth1 proto kernel metric 256 pref medium
default proto static metric 100 pref medium
        nexthop via 2a02:2f0f:2f7:1::1 dev eth0 weight 1 
        nexthop via fdb9:4911:2cf5:8ab9::1 dev eth1 weight 1 
default via fe80::e68d:8cff:feb5:42fb dev eth0 proto ra metric 1024 expires 1496sec pref medium

The problem is that I have two default routes: one via eth0 which is the one I want, and another one via eth1 which is a local address, and would usually get NAT'ed/MASQUERADE'ed. (There's actually a third one via the link-local eth0 IP address.)

This is OK if the container is only to initiate communications to the outside (ping google.com...). It doesn't really matter where the packets go out.

But if the container works as a server, e.g. a HTTP reverse proxy, all kinds of things happen:

The hot fix:

$ nsenter -t $LAST_CID --net ip -6 r d default dev eth1

This essentially removes all default routes except for the link-local one. The new table looks like this:

2a02:2f0f:2f7:1::/64 dev eth0 proto kernel metric 256 pref medium
fdb9:4911:2cf5:8ab9::/64 dev eth1 proto kernel metric 256 pref medium
fe80::/64 dev eth1 proto kernel metric 256 pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
default via fe80::e68d:8cff:feb5:42fb dev eth0 proto ra metric 1024 expires 1759sec pref medium

... which is not perfect, but the connection now works as expected.

A better solution would be this, but it's more difficult to implement manually:

$ nsenter -t $LAST_CID --net ip -6 r d default via fdb9:4911:2cf5:8ab9::1 dev eth1
$ nsenter -t $LAST_CID --net ip -6 r d default via fe80::e68d:8cff:feb5:42fb dev eth0

...which leaves a clean routing table via the public IP:

2a02:2f0f:2f7:1::/64 dev eth0 proto kernel metric 256 pref medium                                                                                                                                                                  
fdb9:4911:2cf5:8ab9::/64 dev eth1 proto kernel metric 256 pref medium                                                                                                                                                              
fe80::/64 dev eth0 proto kernel metric 256 pref medium                                                                                                                                                                             
fe80::/64 dev eth1 proto kernel metric 256 pref medium                                                                                                                                                                             
default via 2a02:2f0f:2f7:1::1 dev eth0 proto static metric 100 pref medium 

In essence I need to enter the container and manipulate the routing table from within. I'm at a loss as to how this could be done elegantly when starting the container. The only solution I can come up with is create the v46bridge network as --internal, which means that it won't be used as a default route.

But this leaves me with the fact that I do have containers which will be started within v46bridge (to be able to communicate with other containers in the same network via host name resolution), will not be included in v6pub (because they contain sensitive services not to be exposed directly), and might need to reach the internet to download stuff -- which then they couldn't for lack of a default route.

Steps to reproduce the issue

Steps to reproduce the issue

  1. Create a bridged and a macvlan network (I'm not sure that the IPv4 part plays a role, but it's what I have as my setup, so I'm using it):

    podman network create v6pub \
        -d macvlan \
        --interface-name eth4 \
        --ipv6 \
        --ip-range 2a02:2f0f:2f7:1:e::/80 \
        --subnet 2a02:2f0f:2f7:1::/64 \
        --gateway 2a02:2f0f:2f7:1::1
    
    podman network create v46bridge --ipv6
  2. Start a container within both of those networks, assign a static IPv6, start a test server:
    podman run -ti --rm --net v6pub:ip6=2a02:2f0f:2f7:1:e::137 --net v46bridge alpine:latest sh -c "echo Testserver | nc -l -p 2345"
  3. Verify that you can ping the test server from the host, but not (always) from an outside machine; furthermore, you can't (always) connect to the test server from an outside machine:
    $ ping 2a02:2f0f:2f7:1:e::137
    $ telnet 2a02:2f0f:2f7:1:e::137 2345
  4. Manipulate the IP routing table to remove extra default entries:
    export LAST_CID=$(crun --root=/run/crun list | grep $(podman ps | tail -n 1 | cut -f1 -d\ ) | cut -f 2 -d\ )
    nsenter -t $LAST_CID --net ip -6 route del default dev eth1
  5. Verify that the connection now works as expected

Describe the results you received

Unreliable connection through ping, even less reliable through "regular" TCP/IP applications, from hosts in different networks than the container host. Flawless "tracepath" though.

Describe the results you expected

Flawless connection in all cases.

podman info output

On a current Fedora CoreOS stable:

root@geryon:~# uname -a
Linux geryon 6.10.6-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Aug 19 14:09:30 UTC 2024 x86_64 GNU/Linux
root@geryon:~# podman --version
podman version 5.2.1

If you are unable to run podman info for any reason, please provide the podman version, operating system and its version and the architecture you are running.

Additional environment details

I did this on a machine connected to the internet; at the very least it should be something where there are multiple networks (at least two).

It's also possible that the connection issues are related to the router.

I'm using a Mikrotik router with MikroTik RouterOS 6.49.17, and, for your consideration the routing table looks like this:

> ipv6 route print 
Flags: X - disabled, A - active, D - dynamic, C - connect, S - static, r - rip, o - ospf, b - bgp, U - unreachable 
       DST-ADDRESS              GATEWAY                  DISTANCE
 0 A S  ::/0                     fe80::d161%ether1-uplink        1
 1 ADC  2a02:2f0f:2f7:1::/64     pub-bridge                      0
 2 ADC  2a02:2f0f:2f7:1043::/64  lan-bridge                      0

The pub-bridge here represents the interfaces connected to the corresponding IPv6 network. So to my understanding there really is nothing here that should prevent proper routing -- given the fact that every single physical machine with an IP in the corresponding range doesn't have any connection problems (to the outside world, through the Mikrotik gateway) what so ever, including the container host.

I'd argue that even if this is a weird Mikrotik IPv6 routing issue (is it?), then Podman is on the hook at least for its inability to control the default route when using more than one network -- which would apparently mitigate the problem in this case :-p

Additional information

Owing to the fairly complex setup, the "steps to reproduce" above is not an exact copy & paste. I've reproduced it based on my work as well as I could, but I apologize in advance for problems or typos I might have inadvertently induced.

The general idea should be clear: when using two or more --net... --net... options with podman run, Podman appears to be confused as to which the default route should be. With macvlan networks this leads to catastrophic inability to actually reliably serve a connection.

Luap99 commented 3 weeks ago

I am not sure if there is something special about ipv6 but what we do is we add for each network a default route which the same setting by default and the kernel seems to do some round robin based logic to route between them which is far from perfect if macvlan and bridge is mixed. One of the reasons is me do not know what that user prefers. What you can do is --opt metric=xx on the macvlan network then the routing priority is no longer the same and it routes based on the metric.

codedump commented 3 weeks ago

What you can do is --opt metric=xx on the macvlan network then the routing priority is no longer the same and it routes based on the metric.

Thank you for the tip with the metric. It works as expected.

I understand what you're telling me about how Podman works (I assumed something like that). I am not sure whether any other options are necessary, e.g. for disabling/enabling the defroute on a per-container basis. But what you propose does fix my problem.

Come to think of it, it's actually documented in the podman-network-create manpage :-) I just didn't see it before.

Luap99 commented 3 weeks ago

For the multiple bridge networks it doesn't really matter to much as the traffic is masqueraded on the host anyways but for macvlan this is certainly not very nice out of the box behavior. I wonder if you should use a lower metric by default for macvlan/ipvlan compared to bridge to avoid such async routing behavior. @mheon WDYT? It would still be a problem for multiple macvlan networks on the same container but this seems much less likely to happen.

mheon commented 3 weeks ago

Having a different default weight for macvlan/ipvlan and bridge interfaces seems like a very good idea for the exact reason you mentioned - would resolve a lot of potentially awkward behavior when using both types of networks at the same time.

Luap99 commented 3 weeks ago

The problem is if we change the default for macvlan now it might break others that already depend on the current default. i.e. a user using two macvlan network one with metric 99 and the other default 100, so if we lower macvlan by default it will change the routing order unexpectedly so I don't think we can do this without a major version IMO.