SNAS / openbmp

OpenBMP Server Collector
www.openbmp.org
Eclipse Public License 1.0
232 stars 76 forks source link

performance figures and requirements #27

Closed nskalis closed 6 years ago

nskalis commented 7 years ago

Hi,

Thank you very much for open-souring such a useful project. I am thinking of deploying it but I would like to ask for your advice first;

Could you please comment on the requirements and figures in case I want to add

Am concerned how mysql will behave.. and that will create a lot of IO contention.

Of course, if we do not try we do not know. But I would really appreciate your feedback and some performance figures if any.

Many Thanks. Nikos

TimEvens commented 7 years ago

We've done a lot of testing with MySQL and internet (v4 and v6) peering. We've tuned MySQL to achieve the below performance, but we have decided to move away from using MySQL or any other DB for state maintaining BGP data. We will still use MySQL, Mongo, Cassandra, ... for storing static data such as geo-coding. See below details regarding current scale using MySQL. At the end see the details about the future.

There are many variables to consider when looking at performance of UPSERT's into MySQL, and any other transactional DB. FIFO order MUST be maintained by peer, which is a major factor in performance.

Scaling Methods

OpenBMP's design includes two key methods to support consumer scale (e.g. MySQL).

1) Partitions

Produced messages leverage Kafka partitions by partitioning by peer_hash_id. Peer messages will always go to the same partition in Kafka, which maintains FIFO ordering/consumption. Consumers, such as the MySQL consumer, scale using partitions to distribute load.

2) Topics

Sharding (distribution, not load balancing as load balancing is more related to partitions) requires mapping of X's to Y. Sharding allows you to map specific messages (e.g. peers, routers, ...) to consumer sets (e.g. MySQL write server instance). We achieve sharding in OpenBMP by using advanced topic mapping which allows you to define router groups and peer groups. These groups are dynamic and customizable, see the openbmpd.conf file for details on that.

BGP Messages equal many NLRI's (e.g. UPSERTS)

A BGP message packs multiple prefixes (NLRI's). An average internet peer will have roughly 12 prefixes per UPDATE, which is roughly 50,000 - 60,000 BGP messages for 750,000 prefixes.

Collector producing performance

The collector produces to Kafka at a sustained rate of >= 10,000 BGP messages per second per vCPU. Memory defaults to 16MB per router connection. Each message produced to Kafka will contain all the packed NLRI's in the BGP message. openbmpd therefore produces at the BGP message rate, not so much at the NLRI rate. The collector normally performs faster than the router can send messages. Disk IO for the collector is non-existent.

Router transmission rates and RIB DUMPS

A router has a maximum transmission rate of BGP messages per second. We see on average with full internet peering that the maximum BGP message rate, regardless of how many BGP peers, from a router on RIB DUMP is 10,000 - 12,000 messages per second. This is sustained only for the total number of BGP messages that must be dumped.

After RIB DUMP

RIB DUMP's are a huge hit in terms of performance but BMP == BGP == Stateful. A BMP session in a stable environment does not incur many RIB DUMPS. Increment updates for Internet transit peering is actually very little. We see on average only 15 updates per second per internet peer.

MySQL Performance

A single instance of the MySQL consumer can process approx. 4,800 - 12,000 NLRI's (400 - 1000 bgp messages) per second. In the AIO container we run 3 instances of the MySQL consumer with 6 Kafka partitions. We see on average roughly 20,000 NLRI's per second in the AIO container. You can scale the MySQL consumer as needed.

Your Example

If you have 40 full internet peers (~750K prefixes) then the total message count is roughly 2,400,000. For the AIO container to process this based on 3 consumers, it will take approx 22 minutes. This is only for the rib dump and does not cause congestion on the collector or routers because Kafka is the storage/buffer for this consumption. This is only using the default AIO with a small VM footprint of 8 vCPU's, 20GB RAM, and 150GB disk. If you have more vCPU and RAM, then you can scale the AIO vertically by adding more partitions and consumers.
After RIB dump you should expect to see roughly 200 - 600 sustained messages per second. This is sustained rate is not much and can easily be handled by a single instance of the AIO.

Disk IO

Disk IO is huge with both Kafka and MySQL. Kafka can perform fine on 7200 RPM HDD's, but MySQL really needs to run on a SAN or SSD.

WARNING

The AIO container has both Kafka and MySQL. Kafka and MySQL together drive up the IOPS requirement. If disk is slow, such as with 7200 RPM HDD or virtualized disks, the AIO will not perform. It is suggested that the AIO run on SSD's or fast SAN.

Future

MySQL, Cassandra, MongoDB, etc... all have a maximum UPSERT rate of approx 20K per second. The only way to scale those DB's is to scale horizontally using clusters and sharding. Cassandra and Mongo build in auto sharding, but that drives the complexity and operational requirements up to support a large cluster.

BGP data is stateful and requires constant updates to records (e.g. UPSERT) which does not work with time series DB's and has average performance with general purpose DB's such as MySQL, Postgres, Cassandra, MongoDB, etc..

To address the need for high performing UPSERT performance and to address other challenges inherited with using a message bus for stateful data (i.e. Kafka), we will be releasing a new app called BMP-Manager. We'll post full documentation around this soon.

BMP-Manager performance

BMP Manager improves the VM requirements by reducing them. No longer is SSD required nor is a lot of memory. On a single 8 vCPU 8 GB RAM 200GB HDD (7200/10000) or SSD disk VM, a single bmp-manager instance can handle upto ~250 million NLRI's, which is about 357 full IPv4+IPv6 transit peers. The scale is at the number of NLRI's, so the number of peers can be in the thousands depending on peer NLRI's received. The average sustained NLRI UPSERT rate per second is >65,000 (peak supporting up to 150,000 p/s). The uncompressed data size for a full IPv4/IPv6 peer (700k NRLI's) is ~410MB.

nskalis commented 7 years ago

thanks a lot @TimEvens for the detailed and prompt answer. BMP-manager looks very promising, and makes me also happy that we will not have to administrate any mysql, etc.

looking forward for the new release then.

may i ask if BMP-manager is based on a open-source project, if yes which one (in order to get prepared for it) ? and when more or less the next release (with BMP-manager) is expected (2017 Q1 maybe) ?

TimEvens commented 7 years ago

@nskalis, we'll an initial version ready late-Dec early Jan 2017. Currently it's targeted to be open source, so we'll upload it under github.com/OpenBMP/bmp-maanger once we have the initial version ready.

TimEvens commented 6 years ago

closing for now.

monimail commented 6 years ago

Hello,

Any news on this issue ?

Thanks.