SNAS / openbmp

OpenBMP Server Collector
www.openbmp.org
Eclipse Public License 1.0
232 stars 76 forks source link

OpenBMP dies when new router comes online #75

Closed tsetem2003 closed 2 years ago

tsetem2003 commented 5 years ago

Hi all,

Running into a really odd issue. I have a number BMP collectors, they're managing from 50-150 routers per system. After they're running for some period of time, if a new router connects to the collector, the collector "dies".

So the dying part is very odd. Looking through the logs, all I'll typically see is an error similar to

2018-11-09T03:48:57.129328 | ERROR | event_cb | Kafka error: Local: Bad message format

I've turned on debugging, and it seems like librdkafka is still going ok and trying to manage empty queues. The BMP log volume dies down significantly, but it does see the occasional router message (just much less).

I've tried adjusting Kafka settings on both ends, as well as increasing/decreasing resources on the OpenBMP side (buffers, simultaneous routers, etc). The big problem is that when it has an issue, it doesn't die, and results in the service having to be manually restarted. And when it does restart, all of the routers (of which there are many) republish their histories which means various monitoring tasks then get backlogged with millions of old messages.

I have other monitoring in place so I know when it goes down, but it's been very often in the middle of the night. I don't have a local watchdog to see if a collector goes down to just restart it.

I'm wondering if anyone has seen this before and if it's just a bad configuration to look at, or if a solution requires digging through code.

Basic stats

Thanks

--Rick

TimEvens commented 5 years ago

Hi @tsetem2003,

When you say it dies, you mean the process is crashed or that it hangs? If the process is crashing, would it be possible to get me a core file? We should be able to isolate the issue very quickly with a core file.

NOTE:

Ubuntu by default uses apport, which is not available in docker. If you are using Ubuntu as your base host system for docker, than you will need to do the following to enable the docker container to produce a core file.

sysctl -w kernel.core_pattern="/tmp/core.%e.%p.%h.%t"

The above should take effect immediately. Now when/if the process cores, it'll produce a core file in the container /tmp/ dir. You can copy that out by using docker cp openbmp_collector /tmp/core* ./

If the problem is a hang, then that's different and I believe that may be caused by the collector spinning off too many connections to Kafka.

tsetem2003 commented 5 years ago

So in this case it's a hang. If it crashed, it'd be almost easier as it's hooked up to a restart module.

This is on a VM, but it's running a customized version of Ubuntu. I can see about grabbing a core file to sort through.

Is there a way to tune the number of connections to Kafka. Unfortunately I'm a Kafka newb as well. I've tried to follow best practices, but not sure if what I've done made it better or worse.

Thanks

--Rick

tsetem2003 commented 2 years ago

Marked as closed. Not using OpenBMP and went to pmacct for BMP needs