FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.34k stars 1.25k forks source link

static routes not removed from kernel #16666

Closed ardenisov closed 4 days ago

ardenisov commented 2 months ago

Description

Static routes configured through vty shell not removed from kernel after frr restart

Version

9.1.1

How to reproduce

  1. configure static route

    vtysh
    conf t
    ip route 100.70.1.254/32 Null0
  2. check route in kernel

    ip r | grep 100.70.1.254
    blackhole 100.70.1.254 proto 196 metric 20
  3. stop frr

    sudo docker stop frr
  4. check route in kernel

    ip r | grep 100.70.1.254
    blackhole 100.70.1.254 proto 196 metric 20
  5. start frr

    sudo docker start frr
  6. check route in frr

    vtysh
    show ip route 100.70.1.254/32
    Routing entry for 100.70.1.254/32
    Known via "static", distance 1, metric 0, best
    Last update 00:07:13 ago
    * unreachable (blackhole)
  7. try to delete static route from frr

    vtysh
    conf t
    no ip route 100.70.1.254/32 Null0
    % Refusing to remove a non-existent route
    ip route 100.70.1.254/32 Null0
    ERROR: SET_CONFIG request failed, Error: Only inactive VRFs can be deleted

Expected behavior

static routes should be deleted from kernel

Actual behavior

static routes still in kernel even frr is stopped

Additional context

error in logs

2024/08/27 14:04:36 STATIC: [MHYBZ-5A04C][EC 100663334] error processing configuration change: error [validation] event [validate] operation [destroy] xpath [/frr-vrf:lib/vrf[name='vrf-2001606']] message: Only inactive VRFs can be deleted
2024/08/27 14:04:36 STATIC: [KFEJ3-7JXVF] BE-CLIENT: mgmt_be_txn_cfg_prepare: ERROR: Failed to validate configs txn-id: 1 1 batches, err: 'Only inactive VRFs can be deleted'
2024/08/27 14:04:36 MGMTD: [G7XEF-QM9RV] mgmt_txn_notify_be_cfgdata_reply: ERROR: CFGDATA_CREATE_REQ sent to 'staticd' failed txn-id: 1 batch-id 1 err: Only inactive VRFs can be deleted
2024/08/27 14:04:36 MGMTD: [GGJTQ-VTT01] SET_CONFIG request for client 0xd failed, Error: 'Only inactive VRFs can be deleted'

kernel

5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

docker

Docker version 25.0.3, build 4debf41

frr_runnig.txt frr_startup.txt

Checklist

riw777 commented 2 months ago

How are you starting Zebra? Can you give us the container script that starts FR/R?

ardenisov commented 2 months ago

How are you starting Zebra? Can you give us the container script that starts FR/R?

#!/bin/bash

if [ -r "/lib/lsb/init-functions" ]; then
        . /lib/lsb/init-functions
else
        log_success_msg() {
                echo "$@"
        }
        log_warning_msg() {
                echo "$@" >&2
        }
        log_failure_msg() {
                echo "$@" >&2
        }
fi

source /usr/lib/frr/frrcommon.sh
/usr/lib/frr/watchfrr $(daemon_list)

frrcommon.txt

ps aux | grep frr
    1 root      0:01 /sbin/tini -- /usr/lib/frr/docker-start
    7 root      0:00 {docker-start} /bin/bash /usr/lib/frr/docker-start
   11 root      0:20 /usr/lib/frr/watchfrr zebra mgmtd bgpd staticd bfdd
  159 frr       0:05 /usr/lib/frr/zebra -d -F traditional -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl
  165 frr       0:01 /usr/lib/frr/mgmtd -d -F traditional
  167 frr       0:22 /usr/lib/frr/bgpd -d -F traditional -A 127.0.0.1
  174 frr       0:01 /usr/lib/frr/staticd -d -F traditional -A 127.0.0.1
  177 frr      15:29 /usr/lib/frr/bfdd -d -F traditional -A 127.0.0.1
Darwin4053 commented 1 month ago

1.configure static route vtysh conf t ip route 100.70.1.254/32 Null0 2.check route in kernel ip r | grep 100.70.1.254 blackhole 100.70.1.254 proto 196 metric 20 3.stop frr sudo docker stop frr 4.check route in kernel ip r | grep 100.70.1.254 .blackhole 100.70.1.254 proto 196 metric 20 5.start frr sudo docker start frr 6.check route in frr vtysh 7f4ad6eb72fb# show ip route 100.70.1.254/32 Routing entry for 100.70.1.254/32 Known via "static", distance 1, metric 0, best Last update 00:00:33 ago

ardenisov commented 1 month ago

@riw777 @Darwin4053 Do you know guys, how to debug route updates in kernel when frr stopped?

ardenisov commented 1 month ago

1.configure static route vtysh conf t ip route 100.70.1.254/32 Null0 2.check route in kernel ip r | grep 100.70.1.254 blackhole 100.70.1.254 proto 196 metric 20 3.stop frr sudo systemctl stop frr 4.check route in kernl ip r | grep 100.70.1.254 i didn't see any route here. 5.start frr sudo systemctl start frr 6.check route in frr vtysh 7f4ad6eb72fb# show ip route 100.70.1.254/32 Routing entry for 100.70.1.254/32 Known via "static", distance 1, metric 0, best Last update 00:00:33 ago

  • unreachable (blackhole), weight 1 7.try to delete static route from frr 7f4ad6eb72fb(config)# no ip route 100.70.1.254/32 Null0 7f4ad6eb72fb(config)# 7f4ad6eb72fb(config)# exit 7f4ad6eb72fb# 7f4ad6eb72fb# show ip route 100.70.1.254/32 % Network not in table 7f4ad6eb72fb# exit frr@7f4ad6eb72fb:/$ ip r | grep 100.70.1.254 frr@7f4ad6eb72fb: I followed above steps for reproduce .static route is succesfully deleted from kernel .

What version of frr did you test? I have this problem with 9.1.1, but not with 8.5.

ardenisov commented 1 month ago

@askorichenko hello! Can you help me, Is below fix applicable for routes in default vrf table?https://github.com/FRRouting/frr/pull/15570/commits/69f07fab28b32846a95571eb7404ef870cc3784c I see in pull request https://github.com/FRRouting/frr/pull/15424 that you reproduced bug in default vrf table, but in commit above I see some VRF related code. Also Is it could happen that your fix is not aware of static routes with Null0 (blackhole) nh configured through vtysh?

Darwin4053 commented 1 month ago

There is inconsistency, with docker when the processes receive signals. while passing SIGINT/SIGTERM to staticd sometimes route is getting cleared sometimes not.

ardenisov commented 1 month ago

@Darwin4053 staticd receives somehow SIGKILL instead SIGINT/SIGTERM even /sbin/tini used as ENTRYPOINT in docker image

ppoll([{fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=10, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=6, events=POLLIN}], 6, NULL, [], 8 <unfinished ...>) = ?
+++ killed by SIGKILL +++

As I can see in tini logs, it only reaps watchfrr process correctly with SIGTERM, but all other processes in container end up with SIGKILL.

[DEBUG tini (1)] Passing signal: 'Terminated'
[TRACE tini (1)] No child to reap
[DEBUG tini (1)] Received SIGCHLD
[DEBUG tini (1)] Reaped child with pid: '7'
[INFO  tini (1)] Main child exited with signal (with signal 'Terminated')
[TRACE tini (1)] No child to reap
[TRACE tini (1)] Exiting: child has exited

frr processes in docker for example

ps a
PID   USER     TIME  COMMAND
    1 root      0:00 /sbin/tini -vvv -- /usr/lib/frr/docker-start
    7 root      0:00 {docker-start} /bin/bash /usr/lib/frr/docker-start
   11 root      0:00 /usr/lib/frr/watchfrr zebra mgmtd bgpd staticd bfdd
   27 frr       0:01 /usr/lib/frr/zebra -d -F traditional -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl
   33 frr       0:00 /usr/lib/frr/mgmtd -d -F traditional
   35 frr       0:00 /usr/lib/frr/bgpd -d -F traditional -A 127.0.0.1
   42 frr       0:00 /usr/lib/frr/staticd -d -F traditional -A 127.0.0.1
   45 frr       0:02 /usr/lib/frr/bfdd -d -F traditional -A 127.0.0.1

even all frr daemons have parent pid of tini (1)

cat /proc/27/status | grep PPid
PPid:   1
cat /proc/33/status | grep PPid
PPid:   1
cat /proc/35/status | grep PPid
PPid:   1
cat /proc/42/status | grep PPid
PPid:   1
cat /proc/45/status | grep PPid
PPid:   1

another look to tini childs

pgrep -lP 1
7 /bin/bash
27 /usr/lib/frr/zebra
33 /usr/lib/frr/mgmtd
35 /usr/lib/frr/bgpd
42 /usr/lib/frr/staticd
45 /usr/lib/frr/bfdd
ardenisov commented 1 month ago

@Darwin4053 @riw777 Hello! I confirmed with tini contributors, that it should work with -g option, to send signal to all childs in its process group. But as I see in my container, all daemons has their own pgid.

ps -o pid,ppid,pgid,comm
PID   PPID  PGID  COMMAND
    1     0     1 tini
    7     1     7 docker-start
   11     7     7 watchfrr
   27     1    27 zebra
   33     1    33 mgmtd
   35     1    35 bgpd
   42     1    42 staticd
   45     1    45 bfdd
  117     0   117 bash
  135   117   135 ps

Also I find in watchfrr code, that it to set different pgid for every daemon. https://github.com/FRRouting/frr/blob/master/watchfrr/watchfrr.c#L321 How can I overcome this watchfrr behaviour?

ardenisov commented 4 days ago

Hello! I have some updates. I eliminated tini as entrypoint, cause it doesn't help to stop frr daemons clearly. Also I added some code to docker-start file, so it can trap TERM signal, forward it to watchfrr and flush static routes from kernel.

Darwin4053 commented 4 days ago

1.configure static route vtysh conf t ip route 100.70.1.254/32 Null0 2.check route in kernel ip r | grep 100.70.1.254 blackhole 100.70.1.254 proto 196 metric 20 3.stop frr sudo systemctl stop frr 4.check route in kernl ip r | grep 100.70.1.254 i didn't see any route here. 5.start frr sudo systemctl start frr 6.check route in frr vtysh 7f4ad6eb72fb# show ip route 100.70.1.254/32 Routing entry for 100.70.1.254/32 Known via "static", distance 1, metric 0, best Last update 00:00:33 ago

  • unreachable (blackhole), weight 1 7.try to delete static route from frr 7f4ad6eb72fb(config)# no ip route 100.70.1.254/32 Null0 7f4ad6eb72fb(config)# 7f4ad6eb72fb(config)# exit 7f4ad6eb72fb# 7f4ad6eb72fb# show ip route 100.70.1.254/32 % Network not in table 7f4ad6eb72fb# exit frr@7f4ad6eb72fb:/$ ip r | grep 100.70.1.254 frr@7f4ad6eb72fb: I followed above steps for reproduce .static route is succesfully deleted from kernel .

What version of frr did you test? I have this problem with 9.1.1, but not with 8.5.

Darwin4053 commented 4 days ago

IN 9.1.1 only I have tested.