FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.19k stars 1.23k forks source link

bgpd from version 8.4 to 9.1 (current stable) memory leak when using default-originate route map (set large-community / set community) #14828

Open IvayloJ opened 9 months ago

IvayloJ commented 9 months ago

bgpd# show version FRRouting 9.0.1 (gazgw) on Linux(6.1.0-12-amd64). Copyright 1996-2005 Kunihiro Ishiguro, et al. configured with: '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-option-checking' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--localstatedir=/var/run/frr' '--sbindir=/usr/lib/frr' '--sysconfdir=/etc/frr' '--with-vtysh-pager=/usr/bin/pager' '--libdir=/usr/lib/x86_64-linux-gnu/frr' '--with-moduledir=/usr/lib/x86_64-linux-gnu/frr/modules' '--disable-dependency-tracking' '--enable-rpki' '--disable-scripting' '--enable-pim6d' '--with-libpam' '--enable-doc' '--enable-doc-html' '--enable-snmp' '--enable-fpm' '--disable-protobuf' '--disable-zeromq' '--enable-ospfapi' '--enable-bgp-vnc' '--enable-multipath=256' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-configfile-mask=0640' '--enable-logfile-mask=0640' 'build_alias=x86_64-linux-gnu' 'PYTHON=python3'


system top reports:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15022 frr 20 0 6820200 6.4g 5640 S 0.3 20.4 1:31.66 bgpd
1668 frr 20 0 636332 186880 2040 S 0.0 0.6 59:08.28 zebra

bgpd -> show memory

System allocator statistics:
  Total heap allocated:  > 2GB
  Holding block headers: 10252 KiB
  Used small blocks:     0 bytes
  Used ordinary blocks:  > 2GB
  Free small blocks:     3712 bytes
  Free ordinary blocks:  5282 KiB
  Ordinary blocks:       1404
  Small blocks:          74
  Holding blocks:        3
(see system documentation for 'mallinfo' for meaning)
--- qmem libfrr ---
Type                          : Current#   Size       Total     Max#  MaxBytes
Buffer                        :        6     24         144        6       144
Buffer data                   :        1 variable      4120        8     32960
Host config                   :        8 variable       304        8       304
Completion item               :        0 variable         0        1        24
Command Tokens                :    12944     72      932672    12966    934272
Command Token Text            :     9273 variable    323864     9294    330096
Command Token Help            :     9273 variable    223544     9294    224048
Command Argument              :        2 variable        48       24       576
Command Argument Name         :     2248 variable     53984     2269     54488
Command Match Stack           :        0 variable         0      125      3000
Lexer token (temporary)       :        0 variable         0        2        64
Access List                   :        1     56          56        1        56
Access List Str               :        1     14          24        1        24
Access Filter                 :        2    112         240        2       240
RCU thread                    :        2    128         272        2       272
FRR POSIX Thread              :        4 variable       320        4       320
POSIX sync primitives         :        4 variable       192        4       192
Graph                         :       41      8         984       42      1008
Graph Node                    :    15224     32      611040    15226    611120
Hash                          :      703 variable     34376      703     34376
Hash Bucket                   :   355646     32    14231408   360167  14412216
Hash Index                    :      352 variable   5645048      353   7746280
Interface                     :       14    272        3920       14      3920
Connected                     :       25     48        1416       25      1416
Link List                     :      159     40        6408      163      6576
Link Node                     :      680     24       16512      807     19608
Temporary memory              :      217 variable     11656      229     12168
Bitfield memory               :        2 variable     10256        2     10256
Nexthop                       :      237    152       36200      237     36200
Northbound Node               :      258   1192      307536      258    307536
Northbound Configuration      :        2     24          48        2        48
Northbound Configuration Entry:      372   1032      384144      372    384144
Prefix List                   :       10     88         880       10       880
Prefix List Str               :       10 variable       240       10       240
Prefix List Entry             :       44    136        6048       44      6048
Prefix List Trie Table        :       28   4096      114912       28    114912
Prefix                        :       25     56        1480       25      1480
Privilege information         :        3 variable       136        3       136
Ring buffer                   :       62 variable   3048992       62   3048992
Route map                     :       22    120        2640       22      2640
Route map name                :      171 variable      4104      172      4128
Route map index               :       92    152       14464       92     14464
Route map rule                :      201     40        8088      201      8088
Route map rule str            :      192 variable      4944      192      4944
Route map compiled            :      267 variable      6408      267      6408
Route map dependency          :       42     24        1008       42      1008
Route map dependency data     :       75     16        1800       75      1800
Skip List                     :        2     56         112        2       112
Skip Node                     :        2    160         336        2       336
Skiplist Counters             :        2     68         144        2       144
Socket union                  :       44    112        5280       44      5280
Stream                        :       49 variable   2602472    33800   7476528
Stream FIFO                   :       62     64        4464       62      4464
Route table                   :      144     56        8128      144      8128
Route node                    :       56    120        6880       56      6880
Thread                        :       81    160       13640      110     18544
Thread master                 :       12 variable     50352       12     50352
Thread Poll Info              :        6   8192       49200        6     49200
Thread stats                  :       29     96        3032       29      3032
Typed-hash bucket             :       53 variable   6304328       53   6304328
Typed-heap array              :        1    576         584        1       584
Vector                        :    30532     24      733952    30537    734072
Vector index                  :    30532 variable    963376    30537    963880
VRF                           :        1    216         216        1       216
VRF bit-map                   :        3      8          72        3        72
VTY                           :        5 variable     61256        5     61256
VTY server                    :        2     32          96        2        96
VTY output buffer             :        0   1038           0        1      1048
VTY history                   :        9 variable       216        9       216
Work queue                    :        3    144         456        3       456
Work queue item               :        0     24           0        7       168
Work queue name string        :        3 variable        72        3        72
YANG module                   :        5     48         296        5       296
YANG data structure           :        0   1032           0        1      1048
Zclient                       :        2   3144        6288        2      6288
Redistribution instance IDs   :        6      2         144        6       144
log thread-local buffer       :        2  24608       49232        2     49232
--- qmem logging subsystem ---
Type                          : Current#   Size       Total     Max#  MaxBytes
log file target               :        1     88          88        1        88
log file name                 :        1     22          24        1        24
syslog target                 :        1     56          56        1        56
--- qmem bgpd ---
Type                          : Current#   Size       Total     Max#  MaxBytes
Peer KeepAlive Timer          :       22     24         528       22       528
BGP Peer pthread Conditional  :        1     48          56        1        56
BGP Peer pthread Mutex        :        1     40          40        1        40
Mac Hash Entry                :        7     16         168        7       168
Mac Hash Entry Intf String    :       14 variable       336       15       360
BGP instance                  :       25 variable     10760       25     10760
BGP listen socket details     :        2    144         304        2       304
BGP peer                      :       53  20864     1106216       53   1106216
BGP peer hostname             :       70 variable      1920       71      1960
Peer group                    :        4     64         288        4       288
BGP Peer group hostname       :        4      9          96        4        96
Peer description              :       26 variable       624       26       624
BGP peer af                   :       26     80        2288       26      2288
BGP update group              :       22    104        2288       22      2288
BGP update subgroup           :       22    240        5456       22      5456
BGP packet                    :       22     56        1232       32      1792
BGP attribute                 :   273809    320    89809800   273825  89815048
BGP aspath                    :    78026     40     3121184    78068   3122864
BGP aspath seg                :    78057     24     1874040    78100   1875296
BGP aspath segment data       :    78057 variable   2014600    78100   2015104
BGP aspath str                :    78027 variable   5007672    78070   5010080
BGP table                     :       87     56        4872       87      4872
BGP node                      :   607055    192   121431832   607169 121454504
BGP route                     :   558869    136    76066488   558971  76080232
BGP ancillary route info      :        1    432         440        1       440
BGP connected                 :       16      4         384       16       384
BGP static                    :        7    152        1064        7      1064
BGP adv attr                  :        0     24           0    12594    302368
BGP adv                       :        0     64           0    56940   4099760
BGP synchronise               :       22     48        1232       22      1232
BGP adj in                    :   572533     48    34468296   572635  34478776
BGP adj out                   :   514729     96    53844360   514849  53857560
BGP multipath info            :      679     48       38024      679     38024
BGP AS list                   :        1     40          40        1        40
BGP AS filter                 :        2     48         112        2       112
BGP AS filter str             :        2 variable        48        2        48
community                     :      532     40       21312      534     21408
community val                 :      532 variable     18624      534     18672
community str                 :      520 variable   4264000      521   4272216
extcommunity                  :       53     40        2136       54      2192
extcommunity val              :       53 variable      1352       54      1392
extcommunity str              :       53 variable      7720       53      7720
community-list                :       35     56        1976       35      1976
community-list name           :       35 variable       840       35       840
community-list entry          :       35     56        2120       35      2120
community-list config         :       32 variable       768       32       768
community-list handler        :        1    120         136        1       136
BGP Process queue             :        0     32           0        7       280
BGP transit attr              :        5     24         120        5       120
BGP transit val               :        5     25         200        5       200
BGP nexthop                   :      237    184       43608      237     43608
BGP regexp                    :       34     64        2448       34      2448
BGP own address               :       16     64        1152       16      1152
BGP Filter Information        :      118 variable      2896      120      2944
Large Community               : 79424644     40   3177027200 79424645 3177027240
Large Community display string:      149 variable      9128      149      9128
Large Community value         : 79424644 variable 1906212544 79424645 1906212568
BGP EVPN MH Information       :        1     56          72        1        72
Software Version              :        2     16          48        2        48
BGP Martian Addr Intf String  :       16 variable       384       16       384
BGP PBR Context               :        1     32          40        1        40
BGP interface context         :       14      4         336       14       336
BGP EVPN instance information :        1     56          56        1        56
--- qmem rfapi ---
Type                          : Current#   Size       Total     Max#  MaxBytes
NVE Configuration             :        1   2984        2984        1      2984
RFAPI Generic                 :        1    296         296        1       296
RFAPI Import Table            :        1    208         216        1       216

~ 10 minutes later

system top reports: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15022 frr 20 0 11.2g 11.0g 5640 S 0.3 35.4 2:29.07 bgpd 1668 frr 20 0 636332 186880 2040 S 0.0 0.6 59:09.59 zebra

show memory

System allocator statistics:
  Total heap allocated:  > 2GB
  Holding block headers: 10252 KiB
  Used small blocks:     0 bytes
  Used ordinary blocks:  > 2GB
  Free small blocks:     4544 bytes
  Free ordinary blocks:  5370 KiB
  Ordinary blocks:       1401
  Small blocks:          81
  Holding blocks:        3
(see system documentation for 'mallinfo' for meaning)
--- qmem libfrr ---
Type                          : Current#   Size       Total     Max#  MaxBytes
Buffer                        :        6     24         160        6       160
Buffer data                   :        1 variable      4120        8     32960
Host config                   :        8 variable       304        8       304
Completion item               :        0 variable         0        1        24
Command Tokens                :    12944     72      932672    12966    934272
Command Token Text            :     9273 variable    323864     9294    330096
Command Token Help            :     9273 variable    223544     9294    224048
Command Argument              :        2 variable        48       24       576
Command Argument Name         :     2248 variable     53984     2269     54488
Command Match Stack           :        0 variable         0      125      3000
Lexer token (temporary)       :        0 variable         0        2        64
Access List                   :        1     56          56        1        56
Access List Str               :        1     14          24        1        24
Access Filter                 :        2    112         240        2       240
RCU thread                    :        2    128         272        2       272
FRR POSIX Thread              :        4 variable       320        4       320
POSIX sync primitives         :        4 variable       192        4       192
Graph                         :       41      8         984       42      1008
Graph Node                    :    15224     32      611040    15226    611120
Hash                          :      703 variable     34376      703     34376
Hash Bucket                   :   355834     32    14239008   360167  14412216
Hash Index                    :      352 variable   5645048      353   7746280
Interface                     :       14    272        3920       14      3920
Connected                     :       25     48        1416       25      1416
Link List                     :      159     40        6408      163      6576
Link Node                     :      680     24       16512      807     19608
Temporary memory              :      217 variable     11656      229     12168
Bitfield memory               :        2 variable     10256        2     10256
Nexthop                       :      237    152       36200      237     36200
Northbound Node               :      258   1192      307536      258    307536
Northbound Configuration      :        2     24          48        2        48
Northbound Configuration Entry:      372   1032      384144      372    384144
Prefix List                   :       10     88         880       10       880
Prefix List Str               :       10 variable       240       10       240
Prefix List Entry             :       44    136        6048       44      6048
Prefix List Trie Table        :       28   4096      114912       28    114912
Prefix                        :       25     56        1480       25      1480
Privilege information         :        3 variable       136        3       136
Ring buffer                   :       62 variable   3048992       62   3048992
Route map                     :       22    120        2640       22      2640
Route map name                :      171 variable      4104      172      4128
Route map index               :       92    152       14464       92     14464
Route map rule                :      201     40        8088      201      8088
Route map rule str            :      192 variable      4944      192      4944
Route map compiled            :      267 variable      6408      267      6408
Route map dependency          :       42     24        1008       42      1008
Route map dependency data     :       75     16        1800       75      1800
Skip List                     :        2     56         112        2       112
Skip Node                     :        2    160         336        2       336
Skiplist Counters             :        2     68         144        2       144
Socket union                  :       44    112        5280       44      5280
Stream                        :       49 variable   2602472    33800   7476528
Stream FIFO                   :       62     64        4464       62      4464
Route table                   :      144     56        8128      144      8128
Route node                    :       56    120        6880       56      6880
Thread                        :       79    160       13288      110     18544
Thread master                 :       12 variable     50352       12     50352
Thread Poll Info              :        6   8192       49200        6     49200
Thread stats                  :       29     96        3032       29      3032
Typed-hash bucket             :       53 variable   6304328       53   6304328
Typed-heap array              :        1    576         584        1       584
Vector                        :    30532     24      733952    30537    734072
Vector index                  :    30532 variable    963376    30537    963880
VRF                           :        1    216         216        1       216
VRF bit-map                   :        3      8          72        3        72
VTY                           :        5 variable     61256        5     61256
VTY server                    :        2     32          96        2        96
VTY output buffer             :        0   1038           0        1      1048
VTY history                   :        1 variable        24       10       240
Work queue                    :        3    144         456        3       456
Work queue item               :        0     24           0        7       168
Work queue name string        :        3 variable        72        3        72
YANG module                   :        5     48         296        5       296
YANG data structure           :        0   1032           0        1      1048
Zclient                       :        2   3144        6288        2      6288
Redistribution instance IDs   :        6      2         144        6       144
log thread-local buffer       :        2  24608       49232        2     49232
--- qmem logging subsystem ---
Type                          : Current#   Size       Total     Max#  MaxBytes
log file target               :        1     88          88        1        88
log file name                 :        1     22          24        1        24
syslog target                 :        1     56          56        1        56
--- qmem bgpd ---
Type                          : Current#   Size       Total     Max#  MaxBytes
Peer KeepAlive Timer          :       22     24         528       22       528
BGP Peer pthread Conditional  :        1     48          56        1        56
BGP Peer pthread Mutex        :        1     40          40        1        40
Mac Hash Entry                :        7     16         168        7       168
Mac Hash Entry Intf String    :       14 variable       336       15       360
BGP instance                  :       25 variable     10760       25     10760
BGP listen socket details     :        2    144         304        2       304
BGP peer                      :       53  20864     1106216       53   1106216
BGP peer hostname             :       70 variable      1920       71      1960
Peer group                    :        4     64         288        4       288
BGP Peer group hostname       :        4      9          96        4        96
Peer description              :       26 variable       624       26       624
BGP peer af                   :       26     80        2288       26      2288
BGP update group              :       22    104        2288       22      2288
BGP update subgroup           :       22    240        5456       22      5456
BGP packet                    :       22     56        1232       32      1792
BGP attribute                 :   273963    320    89860312   274031  89882616
BGP aspath                    :    78060     40     3122544    78090   3123744
BGP aspath seg                :    78091     24     1874856    78121   1875576
BGP aspath segment data       :    78091 variable   2015480    78121   2016552
BGP aspath str                :    78061 variable   5009240    78091   5012424
BGP table                     :       87     56        4872       87      4872
BGP node                      :   607105    192   121441880   607185 121457896
BGP route                     :   558907    136    76071688   558986  76082432
BGP ancillary route info      :        1    432         440        1       440
BGP connected                 :       16      4         384       16       384
BGP static                    :        7    152        1064        7      1064
BGP adv attr                  :        0     24           0    12594    302368
BGP adv                       :        0     64           0    56940   4099760
BGP synchronise               :       22     48        1232       22      1232
BGP adj in                    :   572571     48    34469784   572650  34478776
BGP adj out                   :   514723     96    53843848   514921  53864328
BGP multipath info            :      679     48       38024      679     38024
BGP AS list                   :        1     40          40        1        40
BGP AS filter                 :        2     48         112        2       112
BGP AS filter str             :        2 variable        48        2        48
community                     :      532     40       21312      534     21408
community val                 :      532 variable     18624      534     18672
community str                 :      520 variable   4264000      521   4272216
extcommunity                  :       53     40        2136       54      2192
extcommunity val              :       53 variable      1352       54      1392
extcommunity str              :       53 variable      7720       53      7720
community-list                :       35     56        1976       35      1976
community-list name           :       35 variable       840       35       840
community-list entry          :       35     56        2120       35      2120
community-list config         :       32 variable       768       32       768
community-list handler        :        1    120         136        1       136
BGP Process queue             :        0     32           0        7       280
BGP transit attr              :        5     24         120        5       120
BGP transit val               :        5     25         200        5       200
BGP nexthop                   :      237    184       43608      237     43608
BGP regexp                    :       34     64        2448       34      2448
BGP own address               :       16     64        1152       16      1152
BGP Filter Information        :      118 variable      2896      120      2944
Large Community                         : 143792317     40   5751753192 143792318 5751753232
Large Community display string:      149 variable      9128      149      9128
Large Community value               : 143792317 variable 3451044872 143792318 3451044896
BGP EVPN MH Information       :        1     56          72        1        72
Software Version              :        2     16          48        2        48
BGP Martian Addr Intf String  :       16 variable       384       16       384
BGP PBR Context               :        1     32          40        1        40
BGP interface context         :       14      4         336       14       336
BGP EVPN instance information :        1     56          56        1        56
--- qmem rfapi ---
Type                          : Current#   Size       Total     Max#  MaxBytes
NVE Configuration             :        1   2984        2984        1      2984
RFAPI Generic                 :        1    296         296        1       296
RFAPI Import Table            :        1    208         216        1       216

IPv4 Unicast Summary (VRF default): BGP router identifier XXX.XXX.XXX.XXX, local AS number XXXXXX vrf-id 0 BGP table version 128817 RIB entries 235456, using 43 MiB of memory Peers 13, using 265 KiB of memory Peer groups 4, using 256 bytes of memory

IPv6 Unicast Summary (VRF default): BGP router identifier XXX.XXX.XXX.XXX, local AS number XXXXXX vrf-id 0 BGP table version 254868 RIB entries 371565, using 68 MiB of memory Peers 13, using 265 KiB of memory Peer groups 4, using 256 bytes of memory

The OS is linux debian bookworm (all is upgrade to current stable). FRR is installed from package ( repository: https://deb.frrouting.org/frr)

The bgpd system memory usage keep growing until it not consume all the available memory (more than 25G after a hour or two) and kernel's oom-killer just kill the process. From what I see largest memory usage is from large community structures, in the config bgpd do set and match on every peer and peer-group (in the route maps).

I tried to debug with valgrind but it is hard, because that router is in production environment.

donaldsharp commented 9 months ago

can we see the routemaps surrounding large community usage?

IvayloJ commented 9 months ago

Sure, I send you the complete bgpd.conf on your email, because it could be related and to the other config options I have no idea yet. Just spot bgpd crashing periodically couple of hours ago and so far it is all the debug info I have...

IvayloJ commented 9 months ago

More info:

  1. The same memory leak exists in the stocked debian frr version: frr/stable,stable-security,now 8.4.4-1.1~deb12u1 amd64. With very similar config (but not same) on version 8.1 seems to work normal.
  2. Removing all 'additive' sets (keep sets but not on additive basis - clearing tags and set new lc/c tags) on large-community and on community in the route maps, drastically decreasing memory usage, also drastically decrease the rate of increasing consumed memory over the time. But I still observing increase ram usage (before bgpd was killed by the oom-killer for 1.5-2 hours, now will be approximately after 60-80 hours ). More puzzling is that 'additive' sets on normal community tags affects memory usage on large-community structures (by the 'show memory' command).
  3. I'm almost sure the problem do not come from match, it come from set. Run same match logic with same route maps on 9.0.1 with more peers and more prefixes, and there are no problems, the memory usage is 400k and not increasing to 30G :)
IvayloJ commented 9 months ago

More debug info:

The memory leak come between versions 8.3.1 and 8.4.0 !

Here are some tests Same config file, same peers same ipv4/ipv6 prefixes (approximately +/- couple from half a million prefixes ).

# show memory 

version 8.3.0

community                     :      500     40       20000      502     20080
community val                 :      500 variable     17808      502     17968
community str                 :      480   8192     3936000      481   3944200
Large Community               :     1892     40       75696     1893     75736                    <--- no change after 10 minutes work
Large Community display string:      158   8192     1295600      158   1295600
Large Community value         :     1892 variable     46240     1893     46312           <--- no change after 10 minutes work

8.3.1

community                     :      500     40       20016      501     20056
community val                 :      500 variable     17808      501     17944
community str                 :      480   8192     3936000      480   3936000
community-list                :       35     56        1960       35      1960
community-list name           :       35 variable       840       35       840
community-list entry          :       35     56        2072       35      2072
community-list config         :       32 variable       768       32       768
community-list handler        :        1    120         136        1       136
Large Community               :     1891     40       75640     1892     75680                    <--- no change after 10 minutes work
Large Community display string:      157   8192     1287400      158   1295600
Large Community value         :     1891 variable     46232     1892     46256           <--- no change after 10 minutes work

8.4.0

community                     :      500     40       20000      501     20056
community val                 :      500 variable     17776      501     17912
community str                 :      482   8192     3952400      482   3952400
community-list                :       35     56        1960       35      1960
community-list name           :       35 variable       840       35       840
community-list entry          :       35     56        2152       35      2152
community-list config         :       32 variable       768       32       768
community-list handler        :        1    120         136        1       136
Large Community               :   116733     40     4680856   116734   4680896          <-- and keep growing with the time
Large Community display string:      158   8192     1295600      158   1295600
Large Community value         :   116733 variable   2810616   116734   2810640 <-- and keep growing with the time

All statistics I am showing are collected after all neighbors was connected and all prefixes accepted/announced !

I dont know yet which part of the code cause memory leak when just set large-community, but after half a day compile and change version I can confirm the memory leak exists in all frr versions starting from 8.4.0. In first look it seems no changes in bgp_lcommunity.c and bgp_lcommunity.h. If anyone have some idea or memory what could cause this will greatly appreciate to share (will save me a lot of time to debug and reinvent the hot water :).

IvayloJ commented 7 months ago

After a lot of lurking and testing found the commit which makes the problem:
( https://github.com/FRRouting/frr/commit/8cb56fb2738605c3babe0fc191339fbb52da8655 ).

In my config frr sets large community on the route map associated to default-originate . Maybe that cause memory leak and constantly very fast increasing LC structures , maybe that patch mess and another route maps I dont know yet (not very easy to change config logic in production environment just for testing).

Meanwhile maybe someone more familiar (I will have to learn and get know all the code around first) can take a look over that commit and may have idea what is wrong ?

UPDATE: Finally I managed to simulate the leak in test environment. Here is a simple config:

!
! Zebra configuration saved from vty
!   2024/02/04 10:25:24
!
! loaded from 8.2.2
frr version 9.0.1
frr defaults traditional
!
hostname AS1
password tt
enable password tt
log file /var/log/frr/bgpd.log
!
!
!
router bgp 65501
 bgp router-id 10.185.1.31
 no bgp ebgp-requires-policy
 no bgp default ipv4-unicast
 no bgp graceful-restart notification
 bgp graceful-restart-disable
 no bgp network import-check
 neighbor 10.185.1.33 remote-as 65503
 neighbor 10.185.1.33 description R1
 neighbor 10.185.1.33 solo
 !
 address-family ipv4 unicast
  network 10.1.2.0/24
  network 10.1.3.0/24
  network 10.1.4.0/24
  neighbor 10.185.1.33 activate
  neighbor 10.185.1.33 default-originate route-map DEFR
  neighbor 10.185.1.33 soft-reconfiguration inbound
  neighbor 10.185.1.33 route-map T4-IN in
  neighbor 10.185.1.33 route-map T4-OUT out
 exit-address-family
 !
 address-family ipv6 unicast
 exit-address-family
!
exit
!
access-list telnet-filter seq 5 permit 127.0.0.1/32
access-list telnet-filter seq 10 deny any
!
bgp large-community-list standard defroute seq 1 permit 65503:300:300
!
!
route-map T4-IN permit 10
 set large-community 65503:500:330 additive
exit
!
route-map T4-OUT permit 5
 match large-community defroute
 set large-community 65501:300:330
exit
!
route-map T4-OUT permit 10
 set large-community 65501:400:330
exit
!
route-map DEFR permit 10
 set large-community 65501:300:300
exit
!
!
line vty
 access-class telnet-filter
exit
!

After established connection with 10.185.1.33:

AS1# show memory 
...
Large Community               :       28     40        1136       28      1136
Large Community display string:        5 variable       904        5       904
Large Community value         :       28 variable       912       28       912
...
AS1# clear bgp ipv4 10.185.1.33 

wait to connection with the peer to be establish 

AS1# show memory                
...
Large Community               :       35     40        1416       35      1416
Large Community display string:        5 variable       904        5       904
Large Community value         :       35 variable      1080       35      1264
...

As you can see structures for LC are increased, and something is not freed. This leak apply to all frr versions after commit 8cb56fb2738605c3babe0fc191339fbb52da8655 . With snapshot frr-455b860f9db5e5be186a7d1606283b3a02989c64.zip (right before 8cb56 commit) I dont see increasing of the memory because Large Communities.

IvayloJ commented 6 months ago

More debug:

The memory leak comes when use clause "set" in default-originate route map policy and is not present only when setting large-community. Setting community string in the default-originate RM also trigger huge memory leak (community structures start increasing). The worst is that once you put "set" clause in the route-map associated to default-originate the leak is triggered. After that nor delete the route-map itself nor removing route-map from the neighbor section stop increasing the memory in the bgpd process, the only you can do is to restart all frr daemons. Also I believe the things get worse if default route is learned and from other neighbor (even it is filtered on the neighbor input with prefix-list). I experienced and very strange behaviours when changing set clauses in the RM associated with default-originate. All can be simulated with very simple config (such I posted above) in a controlled environment.

I manually remove commit 8cb56fb2738605c3babe0fc191339fbb52da8655 from current stable 9.1 and for more than 3 hours dont observe massive memory increase of the bgpd process. Of course changes on the fly in default-originate RM do not work as expected, but frr is usable and stable (for now). Maybe it is good to revert that commit and update all bin packages from version 8.3.2 up to the current stable. I believe many routers in the internet which uses frr announce default-originate and not few of them uses route-map with "set" clauses. I will keep looking and knowing the code, but it is very complicated and unclear. I dont think there is a simple quick fix, maybe there is a need of total rewrite of the default-originate RM algorithm. Or someone really who knows the work logic in details to take a look.

ton31337 commented 6 months ago

I'll test this and let you know.

IvayloJ commented 6 months ago

Thank you very much Donatas ! If you have idea for a fix or make a patch, I can test it against couple versions in different environments (even in production where bgpd is very heavy loaded and complicated configs).

To saiarcot896: There are a lot of fixes for different leaks from 8.5.1 to 9.1. In all my tests, with versions from 8.4 to 9.1 I observe different memory usage and behaviour and 9.1 seems to consume least memory. Can you compile frr from source in your test environment ? If you can repeat your tests with frr 9.1 and if you observe same leak try with http://i.bglans.net/frr-stable-9.1-without_8cb56fb.tar.gz (it is standart 9.1 downloaded a week ago, and removed commit 8cb56fb2738605c3babe0fc191339fbb52da8655, that I test for 6 days and dont observe leaks in production - couple of milion ipv4+ipv6 internet routes, couple ipv4+ipv6 default routes received/announced/generated, with 15-20 peers), that will help us a lot. Also is there a default route (0.0.0.0/0 or ::/0) among these 6400 routes you are testing with ?

IvayloJ commented 6 months ago

I think maybe have an idea what cause the leak.... In bgpd/bgp_updgrp_adv.c in the function: subgroup_default_originate

                                new_ret = route_map_apply_ext(
                                        peer->default_rmap[afi][safi].map,
                                        bgp_dest_get_prefix(dest), pi, &tmp_pi,
                                        &new_pref);

This call route_map_apply_ext from lib/routemap.c ...

                                        /*
                                         * set cmds return RMAP_OKAY or
                                         * RMAP_ERROR. We do not care if
                                         * set succeeded or not. So, ignore
                                         * return code.
                                         */
                                        (void)(*set->cmd->func_apply)(
                                                set->value, prefix, set_object);

which call route_set_lcommunity -> lcommunity_dup -> lcommunity_new -> calloc new memory because in subgroup_default_originate we treat attr like blank new structure which need to be first initialized.

Seems subgroup_default_originate functions is also called when there are withdraw/announce of the default route learned and from a peer. The check for applying of the default-originate route map ( if (peer->default_rmap[afi][safi].name)... ) is done (only by "if there is a name of the route-map" ?!? not by any real change or RM name change ?) every time in the if scope and not all pointers are freed ( call bgp_attr_flush() ) in these tmp structures, also there is mess how they are used. I dont understand what is the idea behind the "pref" pointer some sort of preference or ?

At all It is posible to be done some properly free, but this is ugly and truly I think its not right. The logic is wrong, and maybe some functionality will not work this way. By me and from user's perspective the logic should be this:

There are 2 ways to send default route to a peer:

  1. Learn it from other peer and re-announce. There default route should be treated like every other prefix.
  2. Own generate and announce (neighbor XXX default-originate). default route should be treated as a high weight and highest priority (like a local network), do what default-originate route-map do. Put it in the peer flow, and after that threat it like any other prefix (from this stage to the end for 1 and 2 apply same logic)

Finally call subgroup_default_originate() which set attrs and need flags (from output route map if there are such), related with the default route which to be announce/withdraw to the peer.

Please correct me if I am wrong somewhere in the logic or missing something

ton31337 commented 6 months ago

@IvayloJ did you check this patch https://github.com/FRRouting/frr/pull/15466?

IvayloJ commented 6 months ago

Yes, already integrated Donald's patch in my testing 9.1 but it will not resolve this leak. I know what cause it, and testing since yesterday ugly (most likely and not very proper) but fast fix. The problem is in subgroup_default_originate(). When call route_map_apply_ext() it allocate new memory for data inside the stucture tmp_pi.attr (tmp_pi.attr = tmp_attr), which we lost because never free. Here is my test change in bgp_updgrp_adv.c -> subgroup_default_originate()

                                new_ret = route_map_apply_ext(
                                        peer->default_rmap[afi][safi].map,
                                        bgp_dest_get_prefix(dest), pi, &tmp_pi,
                                        &new_pref);

                                if (new_ret == RMAP_PERMITMATCH) {
                                        if (new_pref < pref) {
                                                pref = new_pref;
                                                bgp_attr_flush(new_attr);
                                                new_attr = bgp_attr_intern(
                                                        tmp_pi.attr);
//                                              bgp_attr_flush(&tmp_attr);
                                        }
                                        subgroup_announce_reset_nhop(
                                                (peer_cap_enhe(peer, afi, safi)
                                                         ? AF_INET6
                                                         : AF_INET),
                                                new_attr);
                                        ret = new_ret;
                                }
//                              else
//                                      bgp_attr_flush(&tmp_attr);
                                bgp_attr_flush(&tmp_attr);
                        }

So far can confirm few things:

  1. It will stop this leak (https://github.com/FRRouting/frr/issues/14828).
  2. Tiny memory leak will still be present (when changing sets in default-originate route map, the memory containing old set values will be lost and not freed).
  3. Another tiny memory leak still there if delete or change the name of route-map associated with default-originate, some memory will not be released.
  4. Self generated default routes (ipv4/ipv6) are out of the output route-map flow.
  5. Apply/change of default-originate route-map is working.

And now the long part :) During my tests and debug I put simple zlog_err("move here"); on top of the subgroup_default_originate(). Seems it is very "popular" function, it executes on every announce/withdraw/update message we receive and announce no matter if it is default route (self generated and received) or some other prefix. Which explain why I observed huge memory leak with many and noisy peers. That also explain https://github.com/FRRouting/frr/issues/15459 and seems Donald's commit will resolve it. From what see in the comments I think subgroup_default_originate() have been written initially to handle only default-originate very basic usage. With the time rest of the code logic was changed , subgroup_default_originate() do not follow these changes and its usage is wrong now. It can be completely scraped and its function to be handled by default flow functions and default-originate + default-originate route-map functions to be handled in early stages in the code. Or it can stay but default-originate + the route map to it, to be removed from there, and handled in separate function called from updgrp_policy_update_walkcb() && update_group_default_originate_route_map_walkcb(). Also when we set attr for the self generated 0.0.0.0/0 and ::/0 we should use bgp_attr_default_set() the associated route map to default-originate (if there are such) should only change this default attrs. Also this way will gain little speed and optimization, because the time spend in subgroup_default_originate() will be less.

In the current situation there are maybe and other potential problems:

                        for (pi = bgp_dest_get_bgp_path_info(dest); pi;
                             pi = pi->next) {
                                struct attr tmp_attr = attr;
                                tmp_pi.attr = &tmp_attr;

So far as my knowledge (very basic) are over the compilers it depending on versions, bugs compile optimization options and e.t.c. on each iteration in the loop you may or may not get new memory for the "tmp_attr" , this memory may be freed on the scope exit or on the function exit (again differ from compiler to compiler). Which means potential stack crash (depending how much stack memory the thread/process can use) Also it is not very clear if the all memory of tmp_attr will be zeroed or will be filled with some random data in it (the compiler may or may not generate assembler code which will copy the all attr in tmp_attr, may copy only populated data in the attr). So I think best will be to get out of the loop scope declaration, zeroing ( memset(&tmp_attr,0,sizeof(struct attr));), assign the pointer tmp_pi.attr = &tmp_attr , and in the loop do memcpy(&tmp_attr, &attr, sizeof(struct attr)); on each iteration. Yes longer for write but much safer for use :). Have to check and "attr" family functions in attr.c , maybe to use them is best approach for work with struct attr. Also maybe is good to have signalling when there are change in the default-originate route map, because we dont have such.

IvayloJ commented 5 months ago

@ton31337 After all my tests and debugs , maybe have a solution for this issue. It is not perfect, but at all is working and is better than to have this huge leak. When you have a time can you create a patch and backport it all versions from 8.4 above ? Now (9.1 current stable) in bgpd/bgp_updgrp_adv.c (from line 889 to line 934)

    if (peer->default_rmap[afi][safi].name) {
        struct bgp_path_info tmp_pi = {0};

        tmp_pi.peer = bgp->peer_self;

        SET_FLAG(bgp->peer_self->rmap_type, PEER_RMAP_TYPE_DEFAULT);

        /* Iterate over the RIB to see if we can announce
         * the default route. We announce the default
         * route only if route-map has a match.
         */
        for (dest = bgp_table_top(bgp->rib[afi][safi_rib]); dest;
             dest = bgp_route_next(dest)) {
            if (!bgp_dest_has_bgp_path_info_data(dest))
                continue;

            for (pi = bgp_dest_get_bgp_path_info(dest); pi;
                 pi = pi->next) {
                struct attr tmp_attr = attr;

                tmp_pi.attr = &tmp_attr;

                new_ret = route_map_apply_ext(
                    peer->default_rmap[afi][safi].map,
                    bgp_dest_get_prefix(dest), pi, &tmp_pi,
                    &new_pref);

                if (new_ret == RMAP_PERMITMATCH) {
                    if (new_pref < pref) {
                        pref = new_pref;
                        bgp_attr_flush(new_attr);
                        new_attr = bgp_attr_intern(
                            tmp_pi.attr);
                        bgp_attr_flush(tmp_pi.attr);
                    }
                    subgroup_announce_reset_nhop(
                        (peer_cap_enhe(peer, afi, safi)
                             ? AF_INET6
                             : AF_INET),
                        new_attr);
                    ret = new_ret;
                } else
                    bgp_attr_flush(&tmp_attr);
            }
        }
        bgp->peer_self->rmap_type = 0;

Change with:

        /*  During my tests and debug, spot a situation where we have .name but
         *  not and .map (almost every time on the start of the process).
         *  Maybe init is in other thread which starts slower. Don know how to
         *  handle such situation, but in all cases it is safer to enter.
         */
        if (peer->default_rmap[afi][safi].name && peer->default_rmap[afi][safi].map) {
                struct bgp_path_info tmp_pi;
                struct attr tmp_attr;
                struct bgp_dest *pm_dest = NULL;
                struct bgp_path_info *pm_pi = NULL;
                memset(&tmp_pi, 0, sizeof(struct bgp_path_info));
                memset(&tmp_attr, 0, sizeof(struct attr));
                tmp_pi.peer = bgp->peer_self;
                tmp_pi.attr = &tmp_attr;

                SET_FLAG(bgp->peer_self->rmap_type, PEER_RMAP_TYPE_DEFAULT);

                /* Let's try to get lowest rule number in the route-map
                 * Put it in IFs statements to protect from illegal reads.
                 */
                int lrm_rule = 0;
                if( peer->default_rmap[afi][safi].map->head )
                        if( peer->default_rmap[afi][safi].map->head->pref )
                                lrm_rule = peer->default_rmap[afi][safi].map->head->pref;

                /* This semi-conditional logic is sh.. - only on prefix list.
                 * Much more logical/flexible/usable from user point
                 * is to handle default route announce in the peer[s] output
                 * route-map, in the default-originate route-map only to
                 * permit set clauses for self generated default route.
                 * But what is that is. Lets iterate over the RIB (can be
                 * millions times) to check if we will have permit match
                 * to announce default route.
                 */
                for (dest = bgp_table_top(bgp->rib[afi][safi_rib]); dest;
                     dest = bgp_route_next(dest)) {
                        if (!bgp_dest_has_bgp_path_info_data(dest))
                                continue;

                        for (pi = bgp_dest_get_bgp_path_info(dest); pi;
                             pi = pi->next) {

                                /* route_map_apply_ext() allocates new memory
                                 * in tmp_pi.attr ! Better to write new func
                                 * with the same logic (much faster as well)
                                 * to do only check, and when we know what to
                                 * do then to call route_map_apply_ext()
                                */
                                new_ret = route_map_apply_ext(
                                        peer->default_rmap[afi][safi].map,
                                        bgp_dest_get_prefix(dest), pi, &tmp_pi,
                                        &new_pref);
                                if (new_ret == RMAP_PERMITMATCH) {
                                        /* pref is the rule number in the RM
                                         * On which we have match
                                         */
                                        if (new_pref < pref) {
                                                pref    = new_pref;
                                                pm_dest = dest;
                                                pm_pi   = pi;
                                        }
                                        ret = new_ret;
                                }
                                /* _MUST_ be flush on each itteration ! */
                                bgp_attr_flush(&tmp_attr);
                        }
                        /* We know which is lowest rule number (!0) in the RM
                         * and we get match on it. No need to spin more here.
                         */
                        if(ret == RMAP_PERMITMATCH && lrm_rule && lrm_rule == pref)
                                break;
                }

                /*  Final apply RM, Moving this from the loops above for speed.
                 *  pm_pi and pm_dest are pointers, must check them if they
                 *  still are here (in can they are released by another thread)
                 */
                if(ret == RMAP_PERMITMATCH && pm_pi && pm_dest ) {
                        memcpy(&tmp_attr, &attr, sizeof(struct attr));

                        new_ret = route_map_apply_ext(
                                peer->default_rmap[afi][safi].map,
                                bgp_dest_get_prefix(pm_dest), pm_pi, &tmp_pi,
                                &new_pref);

                        bgp_attr_flush(new_attr);
                        new_attr = bgp_attr_intern(&tmp_attr);
                        bgp_attr_flush(&tmp_attr);

                        subgroup_announce_reset_nhop(
                                (peer_cap_enhe(peer, afi, safi)
                                         ? AF_INET6
                                         : AF_INET),
                                new_attr);
                } else
                        ret = RMAP_DENYMATCH;

                bgp->peer_self->rmap_type = 0;

14828.txt

This patch also will have significant speed impact in cases where there are no match clause in the default-originate RM. If you have idea how to check what is the condition of the first rule (permit/deny) will be great. So if for example we have in the config.

route-map XXX deny 1
exit

where obviously is pointless to spin these loops, ( in case of full internet bgp tables can take ~ 2 millions * RM-rules-number iterations).