elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
109 stars 4.93k forks source link

Throttling Beats for system stability #17775

Open mostlyjason opened 4 years ago

mostlyjason commented 4 years ago

Describe the enhancement: Several users have filed several issues requesting the ability to throttle Beats. This is usually requested in order to improve system stability and reduce impact on other applications. Since Beat are a monitoring application, they should not interrupt critical business applications. We'd like to evaluate all these requests and determine the best plan for implementation.

Describe a specific use case for the enhancement or feature: There are several types of resources that users are concerned about:

There are several ways to mitigate these issues

I'm listing them here together because a limit on one may indirectly impose a limit on the others. Thus it may be possible to solve many (but perhaps not all) of these problems with a single solution. There are different ways to implement each of these limits, and pros and cons to each one. We should evaluate each to determine the best solution that will help the most customers.

Why system tools fall short Historically we prefer to rely on system tools to do the rate limiting/QOS because they give operators more control. However, these tools are not accessible to all users, they may be difficult for operators to set up or configure, and they may need to implement a variety of solutions across heterogeneous systems. Providing even a simple limit is better than nothing for users who want an out of the box solution.

Currently, our docs give an example on how to configure limits using tc and iptables. See: https://www.elastic.co/guide/en/beats/filebeat/current/bandwidth-throttling.html#bandwidth-throttling. Also, something that works today is to limit the Beats to a single CPU core, via the max_procs setting.

With systemd one can configure an ExecPreStart, and ExecPostStop script in the systemd files. This allows users to install/remove rules as part of the service startup. Unfortunately systemd has removed the NetClass setting, requiring using to use the tc tool. On linux one can also make use of net_prio + cgroups (e.g. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/resource_management_guide/sec-prioritizing_network_traffic), but AFAIK this is not easy to integrate with systemd, and it looks like cgroups v2 is not really settled on supported cgroup controllers.

Being able to configure the bandwidth outside the beat, more easily allows users to adapt the rate based on time/internal policies as well. The F5 docs (and other network vendors) for example provide some docs on how to do traffic shaping: https://techdocs.f5.com/en-us/bigip-15-1-0/big-ip-access-policy-manager-network-access/shaping-traffic-on-the-network-access-client.html

Deciding on an approach There are multiple ways to set limits, and there are pros and cons to each approach. In general bandwidth limitation can be static vs. dynamic. And maybe even a mix depending on a predefined schedule. One customer is asking for rate limiting in the application(Beats). This is easy to configure from users POV, but potential support is limited to: do we limit based on number of events or bytes? Currently for both Beats and Logstash the unit of work is the event. Until the event hits the outputs we can not tell the actual byte usage. Applying limiting in the output would be possible to some extend, but also limited. For some outputs we have 0 control about network clients and setup. Not having any control over sockets, we can not limit by byte usage at all.

Different outputs may need different limits. The kafka client also does the batching itself, without us being able to assert much control at all. Even with rate limiting before the kafka client, our limits can not be correct, because the batching does lead to spikes/bursts, especially if the remote service was not available for some time. For the other outputs we can create our own connections, and measure bandwidth usage, but still the rate limiting would not be able to take network protocol overhead in mind.

Giving network packets dynamic priority has the advantage of having a dynamic bound. Like: give other application a higher priority, but if we have the bandwidth available now to ingest more data, then do so. But this is can only be decided by the OS or network, not the Beat to be accurate. Not having enough bandwidth available at all leads to data loss or in case of filebeat, issues with filebeat not closing file descriptors (because not all events have been published yet).

A long time ago we created a proposal for event scheduling with PR: https://github.com/elastic/beats/pull/7082. We dropped the proposal/PR, because it would not have solved all possible requirements. Due to batching/buffering bandwidth limitation must be applied in the outputs explicitly or the network layer. This means that we will have to implement support for limiting bandwidth per output, if the client library used allows us to do so. We should either reconsider this approach or identify a better one to solve this issue.

elasticmachine commented 4 years ago

Pinging @elastic/integrations (Team:Integrations)

zez3 commented 4 years ago

Ha anyone seen this token bucket GO implementation? https://github.com/ozonru/filebeat-throttle-plugin

jsoriano commented 4 years ago

Another use case: in cloudfoundry you may want to set an event rate limit per organization, or in kubernetes per namespace.

jsoriano commented 3 years ago

Beats 7.11 will include a rate_limit processor for rate limiting: https://github.com/elastic/beats/pull/22883

zez3 commented 3 years ago

Events that exceed the rate limit are dropped

Good addition, but throttling would be even better

ycombinator commented 3 years ago

@zez3 as I replied to your comment in https://github.com/elastic/beats/pull/22883#issuecomment-745630735:

Thanks for the feedback, @zez3. For now, we are starting with a rudimentary implementation that'll drop events that exceed the rate limit. In the future we may add other strategies like the ones you suggest, either as options to this processor or as separate processors.

PhaedrusTheGreek commented 3 years ago

Another component of this discussion might be event size, as it falls under the category of ingestion controls for pipeline stability. For example, in one instance an organization was successful by restricting event sizes to 500kb at Kafka.

zez3 commented 3 years ago

The future is now. We will most probably move to Elastic stack this year and we kind of need this throttling implemented. How can I as a potential future client or current client influence the development of this feature?

cakarlen commented 4 months ago

Any updates to this issue? I would add that the ability to throttle certain Fleet integrations would be handy as some integrations are more resource-intensive than others. Maybe this would apply to an Fleet agent policy more so than an individual Fleet integration?

zez3 commented 4 months ago

Any updates to this issue? I would add that the ability to throttle certain Fleet integrations would be handy as some integrations are more resource-intensive than others. Maybe this would apply to an Fleet agent policy more so than an individual Fleet integration?

@cakarlen Please see the work done in https://github.com/elastic/beats/issues/35615

zez3 commented 4 months ago

Also follow the https://github.com/elastic/elastic-agent-shipper/issues/16

zez3 commented 4 months ago

I would close this issue now that the shipper is almost functional

@jsoriano