Shopify / statsd-instrument

A StatsD client for Ruby apps. Provides metaprogramming methods to inject StatsD instrumentation into your code.
http://shopify.github.io/statsd-instrument
MIT License
574 stars 97 forks source link

Wake up the dispatch thread once the buffer is past a threshold #309

Closed casperisfine closed 2 years ago

casperisfine commented 2 years ago

Fix: https://github.com/Shopify/statsd-instrument/issues/297

The current sleep loop can lead to large memory usage if lots of events are queued rapidly. In such scenario the buffer can grow very large until the threads wake up and flush it.

To avoid this, after enqueueing a message, if the buffer reached a threshold we explicitly wake up the dispatch thread to trigger a flush.

casperisfine commented 2 years ago

cc @vahe

casperisfine commented 2 years ago

What would be ideal would be to test this branch on the app that was suffering from increased memory usage.

casperisfine commented 2 years ago

We tested this in production today with @vahe. Unfortunately it's not quite enough, the app still generates events much faster than they can be sent.

We saw the @buffer go over 1M entries. The background thread does consume events, but it appear that the problem is that if you generate too much events, then the Ruby scheduler won't let that thread take all the CPU time (because it tries to be fair).

So clearly some metrics need to be sampled, however I think we need to have a better error condition when this happens. Right now it's "not a problem" on that app, because basically sending metrics is blocking, so if you emit too many metrics you slow down the whole app rather than to grow memory.

So I think this PR is an improvement, but we also need to but some kind of max size on the buffer and either block, error out or drop events when the buffer is full.

I'll be looking at this tomorrow.

casperisfine commented 2 years ago

Ok, so I improved this PR futher:

If an application emit so much metrics that the dispatcher thread can't keep up, the buffer may grow unbounded.

This is particularly aggravated by the the thread scheduler that tries not to give the dispatcher thread a disproportionate slice of CPU.

To counter act that, this commit introduce a max_queue_size. Once reached emitter thread will wait for the buffer to be emptied.

This means that applications emitted too many events will be slowed down like they would with the regular UDP Sink. However the only alternative is to drop events.

Additionally, after flushing the buffer, the dispatcher thread now immediately check wether the threshold was reached again. This should improve throughput, as threads may have filled the buffer while the thread was doing IOs.

We should however explore other ways to increase the UDP throughput.