I have been looking at many ingestion flame charts recently, in the context of TSDB and merging changes. They highlighted a few things we could do to speed up indexing a bit, so I thought it would be good to start collecting a list of work items that would help speed up ingestion via _bulk. Feel free to add other things you know of.
[ ] Do segment-based replication rather than document-based replicationstateless blog
[ ] Asynchronously write data to disk While indexing is mostly CPU-bound, flame graphs indicate that we have indexing CPUs waiting for data to be written to disk at times. We could make our translog and directory's IndexOutput asynchronous so that threads of the indexing threadpool would not be wasted waiting for I/O.
[ ] Selectively disable duplicate field name detection Our JSON parsers are configured to fail on duplicate field names. This is good practice, but it's also not completely free and sometimes not necessary, e.g. when TSDB retrieves routing fields on the coordinating node: field uniqueness will get checked when parsing the document on the shard anyway, so maybe we could skip duplicate checks on the coordinating node?
[x] Use the new LongField/DoubleField/KeywordField Lucene benchmarks suggested that using a single field for indexing + doc values performed faster than using one field for indexing and one field for doc values. #93165 #93579
[ ] Reuse Lucene fields A best practice when optimizing for ingestion rate consists of reusing the Lucene fields across documents.
[ ] Skip indexing an _id IDs are one of the most expensive fields to index and merge, so their overhead is not negligible for small documents or cheap mappings (e.g. mostly runtime fields). Could we skip indexing IDs for append-only data?
[ ] Skip writing a _recovery_source While disabling _source helps save the disk usage of the _source field, we still pay the indexing cost for it since the _recovery_source gets indexed instead and only gets eventually removed through background merges. Could we not even write a _recovery_source in some cases, e.g. when the source is synthetic?
[x] Reduce merging pressure Our merging defaults are quite aggressive. Merging less could in-turn help ingest data faster. One way to do it would consist of increasing the merge factor. #94134
[x] Reduce the likelihood to refresh due to hitting the maximum size of the translog The main cost of refreshing an index is that it may create small segments, which in-turn increases the merging overhead. #93524
[ ] Flush bigger segments when hitting the indexing buffer memory threshold When hitting the threshold of the amount of memory we're allowed to spend on the indexing buffer, we currently flush all pending segments at once, while we could just flush the larger ones, which in-turn would also help reduce merging overhead. #34553
Description
I have been looking at many ingestion flame charts recently, in the context of TSDB and merging changes. They highlighted a few things we could do to speed up indexing a bit, so I thought it would be good to start collecting a list of work items that would help speed up ingestion via
_bulk
. Feel free to add other things you know of.IndexOutput
asynchronous so that threads of the indexing threadpool would not be wasted waiting for I/O._id
IDs are one of the most expensive fields to index and merge, so their overhead is not negligible for small documents or cheap mappings (e.g. mostly runtime fields). Could we skip indexing IDs for append-only data?_recovery_source
While disabling_source
helps save the disk usage of the_source
field, we still pay the indexing cost for it since the_recovery_source
gets indexed instead and only gets eventually removed through background merges. Could we not even write a_recovery_source
in some cases, e.g. when the source is synthetic?