PQ default 64mb page could hold fewer elements than a configured large batch size

colinsurprenant commented 4 years ago

The Problem

By default PQ uses queue.page_capacity: 64mb which, in our experience, should not be changed and proved to be a good balance in terms of performance and vm/mmap IO pressure.

That said, a 64mb page limits the number of events a single PQ data page can hold and workers reading from PQ will try to maximise batch sizes upto the configured pipeline.batch.size BUT a batch will never contain events across pages. For example, if a page holds 1000 events and the batch.size is set to 5000, only 1000 events will be returned from the PQ read operation and the batch will have a size of 1000.

This should not generally be a concern when using the default small pipeline.batch.size: 125; in most contexts, there will be orders of magnitude more events in a single page.

But for configurations using a very large batch.size, that we typically see when users might want to use a large(r) bulk size when indexing using the elasticsearch output plugin (which uses the configured batch.size as the indexing bulk size), we could end in a situation where the batch.size is similar or event bigger than to number of events a single PQ page can hold. This can lead to 2 potential problems:

if a page hold slightly more events than the configured batch.size then a first PQ read operation will return a full batch but the next read operation will return small batch with only the remaining unread events in that page.
If a page holds less events than the configured batch.size then each read operation will return less events and if changes are made to increase the batch.size, it will not have any effect and not translate in larger bulk size for exemple when using the elasticsearch output.

How to Diagnose

Currently one way to see ho many events are into each pages is by running bin/pqcheck when logstash is not running and look at the elementCount=XXX on each page, unless data shape changes a lot, this number should be similar across pages.

Ideally a single page should hold a good multiple of batch.size events, probably a minimum of around 5X.

Example Sizes

Some examples:

using the generator input plugin which generated very small events, a single 64mb page holds about 200 000 events.
using a smallish json document with ~20 fields a single 64mb page holds about 10 000 events.

Suggestion

I think we should find a way to report the PQ read operations in relation to the batch.size when pages holds a very small multiple of batch.size event or less than batch.size. I am not sure that a systematic WARN log is a good idea since it risks flooding the logs. Maybe read sizes could be reported in monitoring?

andsel commented 4 years ago

We could introduce 2 new metrics:

count the number of pulls we do from pages. For example if a page has 5000 events and the batch.size is 1000 this counter for this page is 5
count the number of almost empty pulls. Almost empty means we pull data from page and the number of events we pull is <= batch.size If the ratio between the first and the second goes down to less then 2 or 3 we should discover such a problem.

The optimal solution is too pull as much data as the filter section requires, loading/mapping all the pages that we need to accomplish this; however this could lead to pressure the paging part and memory, loading a lot of data from disk.

colinsurprenant commented 4 years ago

Good suggestions @andsel. Supporting multi-pages reads will not be simple to do and I am not sure about the real benefit of supporting this versus tuning the batch size and the page size according to the data shape (given better insights). In fact, I have not seen very large batch sizes to offer performance improvement.

karenzone commented 4 years ago

If this will take a while to fix properly, we could add info in the Troubleshooting section or Best Practices section of the docs. Let me know if that makes sense.

yaauie commented 4 years ago

For almost-empty, we should differentiate between and almost-empty batch resulting from a time-out waiting for enough events to fill a batch (low volume) and an almost-empty batch resulting from the tail end of a page.

andsel commented 4 years ago

Good point @yaauie , in the case we want to warned, it happens that the final pointer of the data are always at the 64Mb edge limit, while for slow writer the final pointer of the retrieved data, is seldom near the end of the page. So our "almost-empty" counter definition could be: pull data from the queue retrieving less data then batch.size and the pull consumes the page.

jsvd commented 4 years ago

We can count, per pipeline:

batches.total: - The number of batches consumed from the queue
batches.full: - The number of batches that have full capacity
batches.timed_out: - Number of batches that were produced due to the batch.delay cut off

From batches.total - batches.timed_out - batches.full we can know how many were cut off due to a page boundary.

This is actually a solid starting point if we ever decide to do adaptive batch sizes, either by event count or event size. With the memory constraints in mind, batch sizes can usually be increased up until batches.timed_out starts increasing.

colinsurprenant commented 4 years ago

@yaauie yes agreed, good point, I did not have the almost empty use-case in mind.

@jsvd I like that! Seems like these metrics should provide all the visibility needed to understand the dynamic of batch sizes vs page size etc. Adaptive batch size might be too much of a step IMO because there are other constrains to an optimal batch size than just its PQ related behaviour but we could definitely either log some hints or derive some performance hints in the metrics UI for example.

elastic / logstash