FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.31k stars 1.25k forks source link

netlink-dp error: No buffer space available (on system startup) #11091

Closed rnurgaliyev closed 1 year ago

rnurgaliyev commented 2 years ago

Describe the bug

We run a small eVPN with ~1200 type-2 routes. Problematic node peers with two route reflectors via iBGP:

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
1.2.3.4     4      65217      1421       164        0    0    0 00:19:45         1186       20 N/A
4.3.2.1     4      65217      1413       164        0    0    0 00:19:45         1181       20 N/A

When I do a cold start of FRR, zebra will report lot of following messages:

Apr 25 16:02:32 router zebra[2960]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: No buffer space available, type=RTM_NEWNEIGH(28), seq=1116, pid=3199244503
Apr 25 16:02:32 router zebra[2960]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: No buffer space available, type=RTM_NEWNEIGH(28), seq=1117, pid=3199244503
Apr 25 16:02:32 router zebra[2960]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: No buffer space available, type=RTM_NEWNEIGH(28), seq=1118, pid=3199244503
Apr 25 16:02:32 router zebra[2960]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: No buffer space available, type=RTM_NEWNEIGH(28), seq=1119, pid=3199244503

Most of MAC addresses as expected are not added to the kernel forwarding table.

I was blaming net.core.rmem_default, net.core.rmem_max, and zebras argument --nl-bufsize. For the test, I've set first two to 200MB, and third one to 16MB. It did not have any effect.

What I don't understand is this: if I try to reset BGP peers with clear bgp * or simply restart FRR completely (all daemons, zebra, bgpd, etc.) everything will be fine. I don't see any slower rate of netlink messages, everything is more or less the same, but no errors are logged, and all MAC addresses are in the kernel. It only happens during the "cold" start, when the system has just booted up.

Errors are visible in zebra data plane statistics:

# sh zebra dplane detailed
Zebra dataplane:
Route updates:            981
Route update errors:      0
Other errors       :      0
Route update queue limit: 200
Route update queue depth: 0
Route update queue max:   202
Dplane update yields:     12
LSP updates:              0
LSP update errors:        0
PW updates:               0
PW update errors:         0
Intf addr updates:        0
Intf addr errors:         0
EVPN MAC updates:         691
EVPN MAC errors:          292
EVPN neigh updates:       592
EVPN neigh errors:        283
Rule updates:             0
Rule errors:              0
Bridge port updates:      0
Bridge port errors:       0
IPtable updates:             0
IPtable errors:              0
IPset updates:             0
IPset errors:              0
IPset entry updates:             0
IPset entry errors:              0
Neighbor Table updates:       0
Neighbor Table errors:        0
GRE set updates:       0
GRE set errors:        0

After zebra restart:

# sh zebra dplane detailed
Zebra dataplane:
Route updates:            1138
Route update errors:      0
Other errors       :      0
Route update queue limit: 200
Route update queue depth: 0
Route update queue max:   202
Dplane update yields:     12
LSP updates:              0
LSP update errors:        0
PW updates:               0
PW update errors:         0
Intf addr updates:        0
Intf addr errors:         0
EVPN MAC updates:         684
EVPN MAC errors:          0
EVPN neigh updates:       594
EVPN neigh errors:        3
Rule updates:             0
Rule errors:              0
Bridge port updates:      0
Bridge port errors:       0
IPtable updates:             0
IPtable errors:              0
IPset updates:             0
IPset errors:              0
IPset entry updates:             0
IPset entry errors:              0
Neighbor Table updates:       0
Neighbor Table errors:        0
GRE set updates:       0
GRE set errors:        0

Can someone please give me a hint or point me at the part of the code which I could try to debug?

Versions

donaldsharp commented 2 years ago

This is a tough one as that these error messages are coming from the kernel. I am not aware of what we should even be checking on the kernel side to increase the buffers.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

frrbot[bot] commented 1 year ago

This issue will be automatically closed in the specified period unless there is further activity.