ExaBGP process get stuck with rapid API updates

newrushbolt commented 1 year ago

Describe the bug

Main ExaBGP process is getting stuck if api process has a lot of flaps between "announce" and "withdraw".
It takes about 5 minutes when flap interval is 500ms.
Once stuck main process is not moving anywhere from one exact syscall:

strace -f -p '870194'
strace: Process 870194 attached
write(5, "neighbor 10.89.1.3 send update s"..., 129

Last log messages looks like this:

processes       route added to neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route removed from neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route removed from neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route removed from neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route removed from neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route removed from neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route removed from neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route added to neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self

To Reproduce

I uploaded a complete test environment to https://github.com/newrushbolt/exabgp_stuck_example, please have a look.

But the key steps are:

Start an ExaBGP with healthcheck-style api process that returns withdraw x.x.x.x or announce x.x.x.x.
Make sure that BGP-session using this api process is up and running
Make this api process "flap" between "announce" and "withdraw" once a second
Wait from 5 to 15 minutes

Expected behavior

ExaBGP daemon runs no matter how many times healthcheck process flaps between "announce" and "withdraw".

Environment (please complete the following information):

OS: Centos Stream 8 with kernel==4.18.0-408.el8.x86_64
Version: 4.2.21 or main(tested on commit 133143563300c9a15714c58a4fbf05d3b0f76c0f)
Python: 3.6, 3.8, 3.9

Additional context

All the additional information (config, envs, other packages versions) could be found in test repo https://github.com/newrushbolt/exabgp_stuck_example.

thomas-mangin commented 1 year ago

Thank you for the lab, I will look into it.

newrushbolt commented 1 year ago

Thanks for you response. You are absolutely correct: exabgp.api.ack was enabled because my env-file was misplaced and ignored.

The freeze itself happens because when API process don't read acks, PIPE buffer(64 kb by default) getting full.
That's untrivial to debug, I couldn't find a way to get used buffer size from outside of processes. You can verify that full buffer is freeze reason this way: 1) Get PID's

# ps afxu
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1 16.0  0.6 261600 54364 ?        Ssl  01:24  86:13 /usr/bin/python3.9 python3.9 /usr/local/bin/exabgp --debug /etc/exabgp/exabgp.conf
root          11  0.1  0.6 249228 51172 ?        Sl   01:24   0:34  \_ /usr/bin/python3.9 /usr/bin/python3.9 /usr/local/bin/exabgp
root          14  0.2  0.3 197832 31068 ?        Sl   01:24   1:19  \_ /bin/python3 /bin/python3 /check.py

2) Find child process STDIN fd ID, stdin is always fd=0:

# tree /proc/14/fd
/proc/14/fd
├── 0 -> pipe:[6017184]
├── 1 -> pipe:[6017185]
└── 2 -> pipe:[6019460]

3) Make sure main process is connected to this fd:

# tree /proc/1/fd | grep '6017184'
├── 7 -> pipe:[6017184]

4) Manually read stdin:

 timeout 2 cat /proc/14/fd/0 > child_stdin.txt

If reading instantly unfreezes the main process, than it's PIPE buffer to blame, and child process probably not reading it's STDIN.

@thomas-mangin I'd like to further dig in this problem, and probably will return later with some kind of pull-request, that will allow to write an error to log, or even auto-restart child process if it is not reading the buffer.

thomas-mangin commented 1 year ago

confirmed as API ack. so closing.

newrushbolt commented 1 year ago

Some time later I figured out that the problem was not only because of ack, but also because of api.neighbor-changes generated incoming messages to API-process, even when api.send = {} and ack=false.
This is kinda confusing, and took a lot of time to debug, so maybe you would consider moving neighbor-changes into send section.

Anyway, great thanks to you and all the contributors for making this product ❤️

thomas-mangin commented 1 year ago

Thank you @newrushbolt - opened an issue to deal with this abnormality.

Exa-Networks / exabgp

ExaBGP process get stuck with rapid API updates #1141