modelindexer doesn't ever respond with 503 to APM agents

marclop commented 2 years ago

Description

While working on the adaptive sampling project during OnWeek, the APM server never returned a 503 back to the APM agents even when the number of transactions per second was quite elevated (300,000 transactions per minute or 5000 transactions/s).

The current modelindexer architecture differs from the previously used libbeat queues, and won't cause the APM Server HTTP server to respond back to the APM agents with a 503, since there are no "full queues" concept.

When all the bulk indexers are busy flushing, the indexer may have no available bulk indexer to retrieve from the available buffered channel: https://github.com/elastic/apm-server/blob/bf75ef5e0fbebc6a65f8095231175ff95eb26cd7/model/modelindexer/indexer.go#L232-L236

Which means that the APM Server will continue accepting events from the APM agents and the current goroutine will most likely be put in a sleeping state since it's receiving from a channel. This isn't necessarily bad, but it may increase the memory consumption since there are no bounds on how many batches may be waiting for the bulk indexers to be released back after they've been flushed.

EDIT: The APM Server will return 503 when the context is cancelled when it can't process the ingest requests on time, but we may want to provide some back pressure before that happens.

Possible solutions

Since we're going to be using the new modelindexer from 8.0.0 onwards, it may be good to explore some ways to limit time that we wait for a bulk indexer to become available, measure and quantify the impact that this has.

stuartnelson3 commented 2 years ago

My initial thought is to add a timeout and then have some sort of disk storage, ie. wait for an initial N sec for i.available, and if it's not, then:

set a flag i.noneAvailable
add items to some sort of i.disk with the same interface/settings as i.active, but it commits to disk when reaching some threshold.
on each new event, try for shorter period (N-x sec) to read from i.available, and
- if there's a new bulk indexer available, start writing to that. Also have some sort of strategy to start reading events written to disk into the bulk indexer so they'll be sent to ES
- if there's no new bulk indexer available, send the event to the disk bulk indexer

Depending on the container configuration, events written to disk count as memory for the container anyway, but i suppose that would be up to the user to properly configure.

axw commented 2 years ago

@stuartnelson3 +1 having overflow to disk would be great.

The current modelindexer architecture differs from the previously used libbeat queues, and won't cause the APM Server HTTP server to respond back to the APM agents with a 503, since there are no "full queues" concept.

Maybe semantics, but I think it does. We just don't respond with "queue is full" after an arbitrary one second timeout, but instead we now wait for the HTTP request timeout.

When all the bulk indexers are busy flushing, the indexer may have no available bulk indexer to retrieve from the available buffered channel:

...

Which means that the APM Server will continue accepting events from the APM agents and the current goroutine will most likely be put in a sleeping state since it's receiving from a channel. This isn't necessarily bad, but it may increase the memory consumption since there are no bounds on how many batches may be waiting for the bulk indexers to be released back after they've been flushed.

This processing of events happens synchronously, in the same goroutine that receives the events from agents. If there are no buffers available, the server will block until one is available or the HTTP request times out; this will block the receipt of more events from that agent/connection.

So is the issue here that you could have many concurrent agents/connections for a longer duration? Is this an issue in practice?

marclop commented 2 years ago

Maybe semantics, but I think it does. We just don't respond with "queue is full" after an arbitrary one second timeout, but instead we now wait for the HTTP request timeout.

That's very true, I initially overlooked the importance of HTTP timeouts when opening this issue, I don't think we need any immediate action here or that this is likely to cause issues.

This processing of events happens synchronously, in the same goroutine that receives the events from agents. If there are no buffers available, the server will block until one is available or the HTTP request times out; this will block the receipt of more events from that agent/connection.

So is the issue here that you could have many concurrent agents/connections for a longer duration? Is this an issue in practice?

Thinking about it some more, it seems unlikely that this change becomes an issue in a production scenario, where the number of APM agents sending data to an APM Server isn't likely to cause a problem unless a the number of agents is astronomically high.

stuartnelson3 commented 2 years ago

Thinking about it some more, it seems unlikely that this change becomes an issue in a production scenario, where the number of APM agents sending data to an APM Server isn't likely to cause a problem unless a the number of agents is astronomically high.

I'm guessing a single machine's number of available file descriptors for incoming connections is going to exceed it's ability to service those requests. But, just an assumption. I'm not sure if customers prefer to scale horizontally or vertically.

simitt commented 2 years ago

We should test this with the new benchmarking framework that @marclop is working on, once available.

simitt commented 2 years ago

elastic / apm-server

modelindexer doesn't ever respond with 503 to APM agents #6953

Description

Possible solutions