apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

Improve memory footprint of SortingCodecReader [LUCENE-9539] #10579

Open asfimport opened 3 years ago

asfimport commented 3 years ago

SortingCodecReader is a very memory heavy since it needs to re-sort and load large parts of the index into memory. We can try to make it more efficient by using more compact internal data-structures, remove the caches it uses provided we define it's usage as a merge only reader wrapper. Ultimately we need to find a way to allow the reader or some other structure to minimize its heap memory. One way is to slice existing readers and merge them in multiple steps. There will be multiple steps towards a more useable version of this class.


Migrated from LUCENE-9539 by Simon Willnauer (@s1monw), updated May 22 2021 Pull requests: https://github.com/apache/lucene-solr/pull/1908, https://github.com/apache/lucene-solr/pull/1909, https://github.com/apache/lucene-solr/pull/1908, https://github.com/apache/lucene-solr/pull/1909

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit c82b99464dae6379380f214f250592a450bbe23b in lucene-solr's branch refs/heads/master from Simon Willnauer https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c82b994

LUCENE-9539: Use more compact datastructures for sorting doc-values (#1908)

This change cuts over from object based data-structures to primitive / compressed data-structures.

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 705faa3b2c02f350c928babc731685e1bfaf1027 in lucene-solr's branch refs/heads/branch_8x from Simon Willnauer https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=705faa3

LUCENE-9539: Use more compact datastructures for sorting doc-values (#1908)

This change cuts over from object based data-structures to primitive / compressed data-structures.

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 17c285d61743da0c06735e06235b20bd5aac4e14 in lucene-solr's branch refs/heads/master from Simon Willnauer https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=17c285d

LUCENE-9539: Remove caches from SortingCodecReader (#1909)

SortingCodecReader keeps all docvalues in memory that are loaded from this reader. Yet, this reader should only be used for merging which happens sequentially. This makes caching docvalues unnecessary.

Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 427e11c7f644a05be93bb801ca394b90dccf8df6 in lucene-solr's branch refs/heads/branch_8x from Simon Willnauer https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=427e11c

LUCENE-9539: Remove caches from SortingCodecReader (#1909)

SortingCodecReader keeps all docvalues in memory that are loaded from this reader. Yet, this reader should only be used for merging which happens sequentially. This makes caching docvalues unnecessary.

Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 on splitting. I had another issue when trying to use this class, which was that it was super slow (as in many times slower than fully reindexing) due to stored fields. So splitting could help there as well as we could rewrite stored fields without any compression like we do at flush time, and the splitting would help keep the amount of temporary disk space we use for uncompressed stored fields under control.