Closed newrushbolt closed 1 year ago
Thank you for the lab, I will look into it.
Thanks for you response.
You are absolutely correct: exabgp.api.ack
was enabled because my env-file was misplaced and ignored.
The freeze itself happens because when API process don't read acks, PIPE buffer(64 kb by default) getting full.
That's untrivial to debug, I couldn't find a way to get used buffer size from outside of processes.
You can verify that full buffer is freeze reason this way:
1) Get PID's
# ps afxu
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 16.0 0.6 261600 54364 ? Ssl 01:24 86:13 /usr/bin/python3.9 python3.9 /usr/local/bin/exabgp --debug /etc/exabgp/exabgp.conf
root 11 0.1 0.6 249228 51172 ? Sl 01:24 0:34 \_ /usr/bin/python3.9 /usr/bin/python3.9 /usr/local/bin/exabgp
root 14 0.2 0.3 197832 31068 ? Sl 01:24 1:19 \_ /bin/python3 /bin/python3 /check.py
2) Find child process STDIN fd ID, stdin is always fd=0:
# tree /proc/14/fd
/proc/14/fd
├── 0 -> pipe:[6017184]
├── 1 -> pipe:[6017185]
└── 2 -> pipe:[6019460]
3) Make sure main process is connected to this fd:
# tree /proc/1/fd | grep '6017184'
├── 7 -> pipe:[6017184]
4) Manually read stdin:
timeout 2 cat /proc/14/fd/0 > child_stdin.txt
If reading instantly unfreezes the main process, than it's PIPE buffer to blame, and child process probably not reading it's STDIN.
@thomas-mangin I'd like to further dig in this problem, and probably will return later with some kind of pull-request, that will allow to write an error to log, or even auto-restart child process if it is not reading the buffer.
confirmed as API ack. so closing.
Some time later I figured out that the problem was not only because of ack
, but also because of api.neighbor-changes
generated incoming messages to API-process, even when api.send = {}
and ack=false
.
This is kinda confusing, and took a lot of time to debug, so maybe you would consider moving neighbor-changes
into send
section.
Anyway, great thanks to you and all the contributors for making this product ❤️
Thank you @newrushbolt - opened an issue to deal with this abnormality.
Describe the bug
Main ExaBGP process is getting stuck if api process has a lot of flaps between "announce" and "withdraw".
It takes about 5 minutes when flap interval is 500ms.
Once stuck main process is not moving anywhere from one exact syscall:
Last log messages looks like this:
To Reproduce
I uploaded a complete test environment to https://github.com/newrushbolt/exabgp_stuck_example, please have a look.
But the key steps are:
withdraw x.x.x.x
orannounce x.x.x.x
.Expected behavior
ExaBGP daemon runs no matter how many times healthcheck process flaps between "announce" and "withdraw".
Environment (please complete the following information):
Additional context
All the additional information (config, envs, other packages versions) could be found in test repo https://github.com/newrushbolt/exabgp_stuck_example.