apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.45k stars 975 forks source link

Increase the default number of merge threads. #13294

Closed jpountz closed 2 months ago

jpountz commented 2 months ago

You need as many merge threads as necessary to make sure that merges can keep up with indexing. But this number depends on the data that you are indexing: if you are only indexing stored fields, merges can copy compressed data directly and merges are only a small fraction of the total indexing+flushing+merging cost. But if you primary index knn vectors, merging N docs may require about as much work as flushing N docs. If you add the fact that documents typically go through multiple rounds of merging, the merging cost can end up being more than half of the total indexing+flushing+merging cost.

This change proposes to update the default number of merge threads assuming an intermediate scenario where merges perform about half of the total indexing+flushing+merging work, ie. it gives half the threads of the system to merges.

One goal of this change is to no longer have to configure a custom number of merge threads on nightly benchmarks, which run on a highly concurrent machine.

mikemccand commented 2 months ago

One goal of this change is to no longer have to configure a custom number of merge threads on nightly benchmarks, which run on a highly concurrent machine.

+1, that's be great to revert to Lucene's defaults for the nightly benchy. Ideally the nightly benchy would be nearly 100% Lucene's defaults ...

mikemccand commented 2 months ago

Maybe in the future (after this change) we could think about a more adaptive approach that'd spin up additional merge threads if the merge cost/time is highish (many vectors, few stored fields).

jpountz commented 2 months ago

Maybe in the future (after this change) we could think about a more adaptive approach that'd spin up additional merge threads if the merge cost/time is highish (many vectors, few stored fields).

I was wondering about the same. Maybe we could resurrect the auto-throttling mechanism, but at the CPU level rather than I/O. E.g. we could track the number of queued merges over some recent period of time, and dynamically increase the number of merge threads if there are queued merges consistently, or decrease it if the queue is always empty.

rmuir commented 2 months ago

I feel like dynamic thread pools never work well in java apps. I have to always set simple static fixedthreadpool everywhere for anyone's tomcat or jetty or anything else that I find, to avoid memory problems.