Feature request: Sharding metrics

penguinlav commented 1 year ago

Hi! Our bioyino cluster consumes a lot of resources from each host. Probably, the cluster node will run out of resources soon. Are there any plans to shard the collected metrics?

Albibek commented 1 year ago

Hi. Are you using agent-based approach so far? You probably could shard your metrics per-agent first.

UPD: I thought I had documentation for agent-based approach published, but there is not. Though, we are mentioning some in this article.

The idea is to have separate non-clustered bioyino instances listening on UDP that will pre-aggregate metrics and send them to a cluster. In this case some of the cluster's load will be offloaded to the agents. It also can help with manual sharding if you have e.g. groups of servers you can easily separate from each other metric-wise. In such case you could point these groups to different agents and point agents to different clusters. If your case doesn't fit into what I've said above, it could be great if you gave more details or a case of usage, I could propose some solution for you.

I was also thinking about "pure" sharding, based on hash rings or kind of this, but could not find a good case for that. Usually what we actually wanted was some kind of routing of incoming metrics more based on prefix or the source than based on a hash. All these many ways of distributing metrics among different destinations seemed to me worth an entire separate product, something similar to carbon-c-relay. Adding all the possible routing algorithms to bioyino would bloat the codebase too much.

penguinlav commented 1 year ago

Yes, we already use this approach with local aggregate nodes. And we are close to the limit of resources on master nodes.

We can increase period of time between sending snapshots (2 sec). As far I can see, it will help reduce cpu consumption for snapshot merging by reducing the number of snapshots in the interval (send metrics to carbon).

But what if there is a sharding feature? Unlimited ability to increase the number of metrics collected by bioyino :)

Albibek commented 1 year ago

To be clear, snapshots interval is regulated by snapshot-interval, not by interval, but you are right, it may reduce some load because of less TCP connection overheads and a lower number of total TCP connections.

Sharding is not so easy when it comes to failing nodes. Let's say you have a 3-node cluster and 10 agents. How would you distribute data among them and how you expect to configure it? What behavior do you expect when 1 cluster node or 2 cluster nodes fail?

avito-tech / bioyino

Feature request: Sharding metrics #71