Improve remote metrics store flushing

b-deam commented 1 year ago

Similar to https://github.com/elastic/rally/issues/1723, benchmarks with many concurrent/parallel tasks that generate a lot of samples can end up producing too many samples for our single-threaded metrics store flush method to handle, causing excessive memory usage on the load driver and delaying metrics delivery to the remote store.

The Driver already wakes up every POST_PROCESS_INTERVAL_SECONDS (30) to flush the collected samples to the metrics store: https://github.com/elastic/rally/blob/2470328e4167e28b6e3f47ffcf0e8700845522e9/esrally/driver/driver.py#L295-L305 https://github.com/elastic/rally/blob/2470328e4167e28b6e3f47ffcf0e8700845522e9/esrally/driver/driver.py#L950-L955 https://github.com/elastic/rally/blob/2470328e4167e28b6e3f47ffcf0e8700845522e9/esrally/driver/driver.py#L1062-L1068

Once self._client.bulk_index() completes we clear the in-memory buffer: https://github.com/elastic/rally/blob/2470328e4167e28b6e3f47ffcf0e8700845522e9/esrally/metrics.py#L924-L942

We use the elasticsearch.helper.bulk() method to iterate over all the documents in the in-memory buffer, sending them in chunks of 5000 docs at a time: https://github.com/elastic/rally/blob/2470328e4167e28b6e3f47ffcf0e8700845522e9/esrally/metrics.py#L89-L93

This approach works fine for most benchmarks, but for challenges like logging-indexing-querying of the elastic/logs track we generate a lot of documents, meaning that this in-memory buffer is often full of tens to hundreds of thousands of documents that are indexed by a single client controlled by the Driver actor. This problem is exacerbated in environments where the load driver is a long way away from the metrics store (i.e. cross regional), or if the metrics store itself is overloaded, because our single client throughput is bound by the latency of each request.

There’s a parallel_bulk helper that uses multiprocessing.pool import ThreadPool, but the Thespian docs specifically call out that:

Actors are not intended to create other processes or threads; attempts to do so may create undefined behavior

We could consider a few things here, but all will require extensive testing:

Decrease the POST_PROCESS_INTERVAL_SECONDS from 30 to allow more frequent flushes, intending to prevent the buffer from growing too large
Increase the chunk_size used by the helper method to send more docs at once (i.e. 10,000), but I don't think this will make a large difference because we're still using a single client
Add some logic to detect the size/num of docs in the in-memory buffer and flush once it grows beyond a certain bound, we'd probably add this here by checking the len(self.coordinator.raw_samples) (or similar) in addition to the timer check: https://github.com/elastic/rally/blob/2470328e4167e28b6e3f47ffcf0e8700845522e9/esrally/driver/driver.py#L301
Make use of the AsyncElasticsearch client instead of the sync client for concurrently flushing metrics, allowing us to use coroutines (and not threads or processes)

b-deam commented 1 year ago

Here's an example of a benchmark where the flushing gets further and further behind:

logs

``` [...] 2023-06-01 05:31:44,977 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 7698 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [5.645342] seconds. 2023-06-01 05:32:20,291 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 6608 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [5.136896] seconds. 2023-06-01 05:32:54,262 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 4701 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [3.870889] seconds. 2023-06-01 05:33:40,384 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 24861 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [15.725329] seconds. 2023-06-01 05:34:17,364 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 9695 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [6.802203] seconds. 2023-06-01 05:34:52,410 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 5970 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [4.929296] seconds. 2023-06-01 05:35:27,891 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 7039 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [5.356641] seconds. 2023-06-01 05:36:17,925 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 31359 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [19.628063] seconds. 2023-06-01 05:36:54,653 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 9643 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [6.537706] seconds. 2023-06-01 05:37:29,353 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 5556 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [4.587056] seconds. 2023-06-01 05:38:03,520 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 4945 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [4.026544] seconds. 2023-06-01 05:39:10,213 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 58085 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [35.715736] seconds. 2023-06-01 05:39:59,862 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 30212 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [19.092454] seconds. 2023-06-01 05:40:35,964 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 7954 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [5.933280] seconds. 2023-06-01 05:41:11,689 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 7043 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [5.595361] seconds. 2023-06-01 05:41:45,371 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 4211 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [3.589514] seconds. 2023-06-01 05:43:22,645 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 109047 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [65.609153] seconds. 2023-06-01 05:44:04,594 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 16993 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [11.586287] seconds. 2023-06-01 05:45:40,952 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 107550 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [64.712352] seconds. 2023-06-01 05:46:35,25 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 37899 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [23.376981] seconds. 2023-06-01 05:47:35,444 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 48301 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [29.687247] seconds. 2023-06-01 05:49:21,694 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 122151 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [74.320019] seconds. 2023-06-01 05:51:42,449 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 177342 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [107.253065] seconds. 2023-06-01 05:52:31,254 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 28988 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [18.212773] seconds. 2023-06-01 05:53:43,532 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 68180 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [41.309341] seconds. 2023-06-01 05:55:43,475 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 143725 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [87.643339] seconds. 2023-06-01 05:59:22,203 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 304369 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [183.337449] seconds. 2023-06-01 06:01:50,601 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 189635 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [115.103130] seconds. 2023-06-01 06:05:22,154 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 291041 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [175.944680] seconds. 2023-06-01 06:09:20,74 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 335387 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [202.510297] seconds. 2023-06-01 06:13:34,667 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 361446 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [217.943404] seconds. 2023-06-01 06:18:14,785 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 401978 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [242.862935] seconds. 2023-06-01 06:25:56,174 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 694914 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [418.961541] seconds. 2023-06-01 06:31:11,312 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 442631 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [276.282012] seconds. 2023-06-01 06:42:10,776 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 963791 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [611.130494] seconds. 2023-06-01 06:54:05,328 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1045172 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [665.431251] seconds. 2023-06-01 07:06:53,874 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1114536 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [717.690627] seconds. 2023-06-01 07:19:33,335 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1104502 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [708.316650] seconds. 2023-06-01 07:38:39,865 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1694037 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [1081.415942] seconds. 2023-06-01 07:57:48,677 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1703366 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [1084.819145] seconds. 2023-06-01 08:05:49,595 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 683827 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [437.580140] seconds. 2023-06-01 08:22:58,518 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1526505 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [971.327037] seconds. 2023-06-01 08:36:57,624 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 1231688 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [784.602905] seconds. 2023-06-01 09:00:05,333 ActorAddr-(T|:45005)/PID:19711 esrally.metrics INFO Successfully added 2050758 metrics documents for race timestamp=[20230601T052826Z], track=[elastic/logs], challenge=[logging-indexing-querying], car=[['external']] in [1316.636413] seconds. ```

b-deam commented 1 year ago

An example of the doc count for a ~20h long running logging-indexing-querying execution vs all other benchmarks ran in the previous two weeks (some executing for 7-10+ hours each):

elastic / rally

Improve remote metrics store flushing #1724