AliyunContainerService / pouch

An Efficient Enterprise-class Container Engine
https://pouchcontainer.io
Apache License 2.0
4.62k stars 947 forks source link

[bug report] network p0 create failed when reboot #2618

Closed elvizlai closed 2 years ago

elvizlai commented 5 years ago

Ⅰ. Issue Description

missing pouch p0 network interface

Ⅱ. Describe what happened

Root VPC, some container(not all) not started and because missing pouch p0 net interface.

After reboot, MUST systemctl restart pouch to recreate p0, then pouch start container manually.

If there are any container can start(--restart always), then p0 won't create.

example:

pouch run -td --restart always --net host alpine

I think p0 MUST create before vetheXXXX.

ifconfig

p0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.5.1  netmask 255.255.255.0  broadcast 192.168.5.255
        inet6 fe80::42:c0ff:fea8:501  prefixlen 64  scopeid 0x20<link>
        ether 02:42:c0:a8:05:01  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 6  bytes 516 (516.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vethef648d6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::64bf:24ff:fecd:d276  prefixlen 64  scopeid 0x20<link>
        ether 66:bf:24:cd:d2:76  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 12  bytes 1032 (1.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Ⅲ. Describe what you expected to happen

Ⅳ. How to reproduce it (as minimally and precisely as possible)

  1. run script
    pouch run -td \
    --restart=always \
    --privileged \
    --sysctl net.core.somaxconn=1024 \
    -v /lib/modules:/lib/modules \
    -e HOST_IP='x.y.z' \
    -e VPNUSER=jack \
    -e VPNPASS="opsAdmin" \
    -p 500:500/udp -p 4500:4500/udp \
    --name=ikev2-vpn \
    sdrzlyz/ikev2:5.7.1
  2. reboot
  3. pouch ps -a

the container is not started as expected.

Ⅴ. Anything else we need to know?

systemctl staus pouch -l

● pouch.service - pouch
   Loaded: loaded (/usr/lib/systemd/system/pouch.service; enabled; vendor preset: disabled)
   Active: active (running) since 三 2018-12-26 17:30:04 CST; 25s ago
 Main PID: 2505 (pouchd)
    Tasks: 17
   Memory: 76.7M
   CGroup: /system.slice/pouch.service
           ├─2505 /usr/local/bin/pouchd
           └─2960 containerd --config /var/lib/pouch/containerd/state/pouch-containerd.toml --log-level info

12月 26 17:30:04 host.localdomain pouchd[2505]: time="2018-12-26T17:30:04.077139796+08:00" level=info msg="Removing stale endpoint 84c27e99 (1f76dc0ce9f8b2dd2d7be0a102e29d0e332228a409aba0f94bceba8c8efdd8a1)"
12月 26 17:30:04 host.localdomain pouchd[2505]: time="2018-12-26T17:30:04.089374272+08:00" level=info msg="Fixing inconsistent endpoint_cnt for network bridge. Expected=0, Actual=1"
12月 26 17:30:04 host.localdomain pouchd[2505]: time="2018-12-26T17:30:04.108381128+08:00" level=warning msg="recover container 84c27e996704fbfb5bc21c23e600d05380447488073e1a1007dbc48cbf4d380b, got a notfound error, start clean the container's resources"
12月 26 17:30:04 host.localdomain pouchd[2505]: time="2018-12-26T17:30:04.135912093+08:00" level=warning msg="There are old containers, don't to initialize network"
12月 26 17:30:04 host.localdomain pouchd[2505]: time="2018-12-26T17:30:04.150457189+08:00" level=info msg="handle event: 84c27e996704fbfb5bc21c23e600d05380447488073e1a1007dbc48cbf4d380b exit"
12月 26 17:30:04 host.localdomain pouchd[2505]: time="2018-12-26T17:30:04.177390154+08:00" level=warning msg="Failed to delete host side interface (vethfe38454)'s link" error="no such device"
12月 26 17:30:04 host.localdomain pouchd[2505]: time="2018-12-26T17:30:04.180422683+08:00" level=error msg="failed to create endpoint: failed to create endpoint 84c27e99 on network bridge: adding interface vethfe38454 to bridge p0 failed: could not find bridge p0: route ip+net: no such network interface"
12月 26 17:30:04 host.localdomain pouchd[2505]: time="2018-12-26T17:30:04.19611468+08:00" level=error msg="failed to handle event: 84c27e996704fbfb5bc21c23e600d05380447488073e1a1007dbc48cbf4d380b exit"
12月 26 17:30:04 host.localdomain pouchd[2505]: time="2018-12-26T17:30:04.265322188+08:00" level=info msg="start to listen to: unix:///var/run/pouchd.sock"
12月 26 17:30:04 host.localdomain systemd[1]: Started pouch.

Ⅵ. Environment:

allencloud commented 5 years ago

Thanks a lot for your feedback. Could you attach the error or failure message in the issue description? @elvizlai

elvizlai commented 5 years ago

@allencloud I update the issue with log appended.

elvizlai commented 5 years ago

journalctl

12月 26 17:34:08 host.localdomain pouchd[3326]: time="2018-12-26T17:34:08+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." module=containerd type=io.containerd.grpc.v1
12月 26 17:34:08 host.localdomain pouchd[3326]: time="2018-12-26T17:34:08+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." module=containerd type=io.containerd.grpc.v1
12月 26 17:34:08 host.localdomain pouchd[3326]: time="2018-12-26T17:34:08+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." module=containerd type=io.containerd.grpc.v1
12月 26 17:34:08 host.localdomain pouchd[3326]: time="2018-12-26T17:34:08+08:00" level=info msg=serving... address="/run/containerd/debug.sock" module="containerd/debug"
12月 26 17:34:08 host.localdomain pouchd[3326]: time="2018-12-26T17:34:08+08:00" level=info msg=serving... address="/var/run/containerd.sock" module="containerd/grpc"
12月 26 17:34:08 host.localdomain pouchd[3326]: time="2018-12-26T17:34:08+08:00" level=info msg="containerd successfully booted in 0.012541s" module=containerd
12月 26 17:34:08 host.localdomain pouchd[3326]: time="2018-12-26T17:34:08.763391573+08:00" level=info msg="success to start containerd" containerd-pid=3333 module=ctrd-supervisord
12月 26 17:34:08 host.localdomain pouchd[3326]: time="2018-12-26T17:34:08.768286594+08:00" level=info msg="success to create 5 containerd clients, connect to: /var/run/containerd.sock"
12月 26 17:34:08 host.localdomain pouchd[3326]: time="2018-12-26T17:34:08.76905849+08:00" level=info msg="Snapshotter is set to be overlayfs"
12月 26 17:34:08 host.localdomain pouchd[3326]: time="2018-12-26T17:34:08.769276734+08:00" level=info msg="invoke pre-start hook in plugin"
12月 26 17:34:08 host.localdomain pouchd[3326]: time="2018-12-26T17:34:08.854821156+08:00" level=warning msg="could not create bridge network for id 462d39135a6c114e13119f5874995dfc1e6cd505fd6abaee4e597c510c67fc51 bridge name p
12月 26 17:34:09 host.localdomain pouchd[3326]: time="2018-12-26T17:34:09.144878279+08:00" level=error msg="getEndpointFromStore for eid 1f76dc0ce9f8b2dd2d7be0a102e29d0e332228a409aba0f94bceba8c8efdd8a1 failed while trying to bu
12月 26 17:34:09 host.localdomain pouchd[3326]: time="2018-12-26T17:34:09.144940644+08:00" level=info msg="Removing stale sandbox 8e6085e6c56397fc030250618e0790b149047d61620b177738b9d6a7fbd33eac (84c27e996704fbfb5bc21c23e600d05
12月 26 17:34:09 host.localdomain pouchd[3326]: time="2018-12-26T17:34:09.145171058+08:00" level=warning msg="Failed deleting endpoint 1f76dc0ce9f8b2dd2d7be0a102e29d0e332228a409aba0f94bceba8c8efdd8a1: failed to get endpoint fro
12月 26 17:34:09 host.localdomain pouchd[3326]: "
12月 26 17:34:09 host.localdomain kernel: IPv6: ADDRCONF(NETDEV_UP): p0: link is not ready
12月 26 17:34:09 host.localdomain NetworkManager[2536]: <info>  [1545816849.1835] manager: (p0): new Bridge device (/org/freedesktop/NetworkManager/Devices/5)
12月 26 17:34:09 host.localdomain NetworkManager[2536]: <info>  [1545816849.2494] device (p0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external')
12月 26 17:34:09 host.localdomain NetworkManager[2536]: <info>  [1545816849.2558] ifcfg-rh: add connection in-memory (6e4554af-2497-4a60-b54c-32841523857e,"p0")
12月 26 17:34:09 host.localdomain NetworkManager[2536]: <info>  [1545816849.2579] device (p0): state change: unavailable -> disconnected (reason 'connection-assumed', sys-iface-state: 'external')
12月 26 17:34:09 host.localdomain NetworkManager[2536]: <info>  [1545816849.2590] device (p0): Activation: starting connection 'p0' (6e4554af-2497-4a60-b54c-32841523857e)
12月 26 17:34:09 host.localdomain NetworkManager[2536]: <info>  [1545816849.2613] device (p0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
12月 26 17:34:09 host.localdomain NetworkManager[2536]: <info>  [1545816849.2618] device (p0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
12月 26 17:34:09 host.localdomain NetworkManager[2536]: <info>  [1545816849.2621] device (p0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
12月 26 17:34:09 host.localdomain NetworkManager[2536]: <info>  [1545816849.2652] device (p0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
12月 26 17:34:09 host.localdomain NetworkManager[2536]: <info>  [1545816849.2660] device (p0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
12月 26 17:34:09 host.localdomain NetworkManager[2536]: <info>  [1545816849.2663] device (p0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
12月 26 17:34:09 host.localdomain NetworkManager[2536]: <info>  [1545816849.2732] device (p0): Activation: successful, device activated.
12月 26 17:34:09 host.localdomain dbus[2512]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
12月 26 17:34:09 host.localdomain systemd[1]: Starting Network Manager Script Dispatcher Service...
12月 26 17:34:09 host.localdomain pouchd[3326]: time="2018-12-26T17:34:09.301777137+08:00" level=info msg="start to listen to: unix:///var/run/pouchd.sock"
12月 26 17:34:09 host.localdomain polkitd[2539]: Unregistered Authentication Agent for unix-process:3320:25319 (system bus name :1.21, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale zh_CN.UTF-8) (disconnect
12月 26 17:34:09 host.localdomain systemd[1]: Started pouch.
12月 26 17:34:09 host.localdomain dbus[2512]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
12月 26 17:34:09 host.localdomain systemd[1]: Started Network Manager Script Dispatcher Service.
12月 26 17:34:09 host.localdomain nm-dispatcher[3433]: req:1 'up' [p0]: new request (3 scripts)
12月 26 17:34:09 host.localdomain nm-dispatcher[3433]: req:1 'up' [p0]: start running ordered scripts...
rudyfly commented 5 years ago

@elvizlai Can you provide all the network information, ifconfig

elvizlai commented 5 years ago

@rudyfly First time init, the ifconfig result(hidden inet with XXX)

when reboot, the p0 and vetha49ec6b(created by pouch run) is gone.

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 67.XXX.XXX.XXX  netmask 255.255.240.0  broadcast 67.230.191.255
        inet6 fe80::a8aa:ff:fe12:9bdc  prefixlen 64  scopeid 0x20<link>
        ether aa:aa:00:12:9b:dc  txqueuelen 1000  (Ethernet)
        RX packets 97729  bytes 101161122 (96.4 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 110256  bytes 58737804 (56.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 64  bytes 5184 (5.0 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 64  bytes 5184 (5.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

p0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.5.1  netmask 255.255.255.0  broadcast 192.168.5.255
        inet6 fe80::42:c0ff:fea8:501  prefixlen 64  scopeid 0x20<link>
        ether 02:42:c0:a8:05:01  txqueuelen 1000  (Ethernet)
        RX packets 97060  bytes 55013917 (52.4 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 79219  bytes 55202494 (52.6 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vetha49ec6b: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::f4a2:ccff:fecf:d063  prefixlen 64  scopeid 0x20<link>
        ether f6:a2:cc:cf:d0:63  txqueuelen 0  (Ethernet)
        RX packets 97060  bytes 56372757 (53.7 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 79240  bytes 55203964 (52.6 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
rudyfly commented 5 years ago

set container restart=always, it will start when daemon recover, while it will set into activeSandbox and cause network can't be initialized, so bridge p0 can't be set. Without bridge p0, the container network can't be set, so cause the problem.

allencloud commented 5 years ago

set container restart=always, it will start when daemon recover, while it will set into activeSandbox and cause network can't be initialized, so bridge p0 can't be set. Without bridge p0, the container network can't be set, so cause the problem.

Do we have any solutions? @rudyfly And can we cover the fix in the next release of PouchContainer. @fuweid

fengzixu commented 5 years ago

I faced the same problem

fengzixu commented 5 years ago

@rudyfly

pouchrobot commented 5 years ago

Thanks for your report, @elvizlai 😱 This is a priority/P1 issue which is highest. Seems to be severe enough. ping @alibaba/pouch , PTAL.

huangjc7 commented 5 years ago

问题描述:

[root@csv-slave13 ~]# pouch run -d -p 8099:80 dockerhub.io/hjc-image-nginx:v1.0 Error: failed to run container f1d418: {"message":"failed to create endpoint f1d41862 on network bridge: adding interface veth99f8b71 to bridge p0 failed: could not find bridge p0: route ip+net: no such network interface"}

操作如下:

pouch network create -n pouchnet -d bridge --gateway 192.168.1.1 --subnet 192.168.1.0/24 测试完毕后 pouch network remove pouchnet

之后在命令如问题描述所示

[root@csv-slave13 ~]# pouch run -d -p 8099:80 dockerhub.io/hjc-image-nginx:v1.0 Error: failed to run container f1d418: {"message":"failed to create endpoint f1d41862 on network bridge: adding interface veth99f8b71 to bridge p0 failed: could not find bridge p0: route ip+net: no such network interface"}