apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.61k stars 1.02k forks source link

Remove Accountable interface on KnnVectorsReader #13241

Closed jpountz closed 5 months ago

jpountz commented 6 months ago

Description

In Lucene 9.0, we removed the Accountable interface on our index readers and file formats, because heap usage had become so low that this metric had become less relevant, and also much less accurate as there were no longer a couple things that would account for a majority of the RAM usage, but rather many small objects that don't account for much memory on their own and that were too hard to estimate properly.

We still kept Accountable on KnnVectorsReader because it used to load lots of things in heap at the time. Since vectors have become much more reasonable since then, I would suggest now removing Accountable on KnnVectorsReader like we did for other file formats in 9.0.

benwtrent commented 6 months ago

I am not sure about this. We eagerly load the nodesByLevel on heap for every field. This means we effectively load the upper-levels of graph on heap. While upper levels scale down in size logarithmically, they can still be thousands of nodes.

See: https://github.com/apache/lucene/blob/d62aa201c81eca5ca7b8dce33d57bf0ef24ce132/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L378-L390

Pulkitg64 commented 6 months ago

That's a good call out @benwtrent. But I was wondering, how much size is considered as big size for tracking purpose? For example, let's say there are 1 million nodes in upper levels and if we are creating array of 1 million length, total memory will be 4 MB per million nodes (4 bytes * 1 MM). So is it fine or not fine to not track this size?

jpountz commented 6 months ago

It's true that these things can still use amounts of heap that are not completely negligible, but I wonder if it's a good reason to preserve this interface. Doc values too can use non-negligible amounts of heap if you have many fields that are multi-valued and/or use dictionary compression.

From a quick look, none of Elasticsearch, Solr, OpenSearch and luceneserver take advantage of this interface on KnnVectorsReader. (I searched for calls to CodecReader#getVectorReader, which are necessary to get an object that you can call #ramBytesUsed() on.)

jpountz commented 5 months ago

@benwtrent Do you have a strong opinion on keeping the Accountable interface? I remember keeping it in 9.0 because vectors were new and wanting to be careful, but I'm keen on cleaning things up now and aligning vectors with other file formats that don't implement this interface.

benwtrent commented 5 months ago

@jpountz I do not have a strong opinion :). Its cool to remove it.