Open b-deam opened 1 year ago
Here's an example of a benchmark where the flushing gets further and further behind:
An example of the doc count for a ~20h long running logging-indexing-querying
execution vs all other benchmarks ran in the previous two weeks (some executing for 7-10+ hours each):
Similar to https://github.com/elastic/rally/issues/1723, benchmarks with many concurrent/parallel tasks that generate a lot of samples can end up producing too many samples for our single-threaded metrics store flush method to handle, causing excessive memory usage on the load driver and delaying metrics delivery to the remote store.
The
Driver
already wakes up everyPOST_PROCESS_INTERVAL_SECONDS
(30) to flush the collected samples to the metrics store: https://github.com/elastic/rally/blob/2470328e4167e28b6e3f47ffcf0e8700845522e9/esrally/driver/driver.py#L295-L305 https://github.com/elastic/rally/blob/2470328e4167e28b6e3f47ffcf0e8700845522e9/esrally/driver/driver.py#L950-L955 https://github.com/elastic/rally/blob/2470328e4167e28b6e3f47ffcf0e8700845522e9/esrally/driver/driver.py#L1062-L1068Once
self._client.bulk_index()
completes we clear the in-memory buffer: https://github.com/elastic/rally/blob/2470328e4167e28b6e3f47ffcf0e8700845522e9/esrally/metrics.py#L924-L942We use the
elasticsearch.helper.bulk()
method to iterate over all the documents in the in-memory buffer, sending them in chunks of 5000 docs at a time: https://github.com/elastic/rally/blob/2470328e4167e28b6e3f47ffcf0e8700845522e9/esrally/metrics.py#L89-L93This approach works fine for most benchmarks, but for challenges like
logging-indexing-querying
of theelastic/logs
track we generate a lot of documents, meaning that this in-memory buffer is often full of tens to hundreds of thousands of documents that are indexed by a single client controlled by theDriver
actor. This problem is exacerbated in environments where the load driver is a long way away from the metrics store (i.e. cross regional), or if the metrics store itself is overloaded, because our single client throughput is bound by the latency of each request.There’s a
parallel_bulk
helper that usesmultiprocessing.pool import ThreadPool
, but the Thespian docs specifically call out that:We could consider a few things here, but all will require extensive testing:
POST_PROCESS_INTERVAL_SECONDS
from 30 to allow more frequent flushes, intending to prevent the buffer from growing too largechunk_size
used by the helper method to send more docs at once (i.e. 10,000), but I don't think this will make a large difference because we're still using a single clientlen(self.coordinator.raw_samples)
(or similar) in addition to the timer check: https://github.com/elastic/rally/blob/2470328e4167e28b6e3f47ffcf0e8700845522e9/esrally/driver/driver.py#L301AsyncElasticsearch
client instead of the sync client for concurrently flushing metrics, allowing us to use coroutines (and not threads or processes)