INL / BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene
http://inl.github.io/BlackLab/
Apache License 2.0
106 stars 52 forks source link

Search cache should be aware of references between Searches. #203

Open jan-niestadt opened 3 years ago

jan-niestadt commented 3 years ago

Right now, in some (hopefully rare) scenarios, searches could be in memory twice, wasting memory and CPU. This is because the logic that decides to remove a search (SearchCacheEntry) from the cache is disconnected from references between Searches.

For example, a SearchHitsWindow relies on a SearchHits. But in theory, the SearchHits could be removed from the cache while SearchHitsWindow still holds a reference to these hits. Then if a new search requires the same hits, a duplicate SearchHits will be created because the previous one is no longer in the cache.

Ideally, while the results of a Search haven't been garbage collected, it should be possible to access them through the cache. Maybe look into weak references to achieve this, or alternatively keep track of references between Search objects manually (the problem here is that usually it won't be a SearchHit that is being referenced, but the resulting Hits object).

This is probably not a huge issue, more of a 'smell' that the current implementation isn't ideal. For example, knowing more about references could make certain scheduling / cache cleanup decisions easier as well.

jan-niestadt commented 3 years ago

E.g. removing a SearchHits from the cache to free up memory is well and good, but if the SearchWindow that's holding on to it is still in the cache, there's no point. Of course, BlsCache will notice that there's still not enough memory and keep removing searches until there is, so this should fix itself.

jan-niestadt commented 2 years ago

Another related issue popped up when we tried to have different logic for when we abort a results search vs. a count. We wanted to abort a "abandoned" count after just 30s (frontend is no longer polling for it) while searches could take longer (especially a large grouping can take over a minute and users want to wait for it).

The problems is that aborting the count throws an exception once the page of results is available, because the (running) count is included in the response. It is needed there, because it determines how many pages of results there are.

The relationship between search and count should prevent the count from being aborted. For now we just use the same time limit for both types of searches.

jan-niestadt commented 1 year ago

See also #183, #266 (and #278, #279 which are about the alternative cache implementation)