https://github.com/quickwit-oss/tantivy-py/issues/359#issuecomment-2428087169 reveals that one of our servers can only handle three opened indexes at once. The reason why remains unclear, but this PR at least reacts to the issue by caching opened indexes in the search.py module's scope. Now we can invoke await SearchIndex(index_name="normal-index").index as many times as desired within one Python process.
Implementation Details
Why a global scope? We want to accommodate caching the Index across:
Across deepcopy of an PaperQAEnvironment, whose tools which contains a PaperSearch tool instance.
So we can't cache the Index in PaperSearch, or else we'd be making Index copies (and the side effects of this are unknown)
Across a Trajectory, where we (1) build the index and (2) use the index in 0+ paper searches
Across a TaskDataset, where we (1) build the index and (2) run 0+ envs for one trajectory each
Across many TaskDataset, for larger experiments
This can only be accomplished using global scope, whose lifetime matches the entire Python process. This unfortunately requires callers to invoke the newly added reap_opened_index_cache at runtime if intermediary cleaning of the cache is desired.
The caching added here can be disabled by setting 1 or true (case insensitive) to the newly-added environment variable PQA_INDEX_DONT_CACHE_INDEXES.
Risks
Race conditions if invoking reap_opened_index_cache while also using PaperSearch
Our test suite intentionally doesn't use reap_opened_index_cache to avoid this, but the tradeoff is our testing slowly accrues indexes across cases
Since caching is enabled by default, if the clients aren't aware of this, it can be considered unexpected statefulness
Motivation
https://github.com/quickwit-oss/tantivy-py/issues/359#issuecomment-2428087169 reveals that one of our servers can only handle three opened indexes at once. The reason why remains unclear, but this PR at least reacts to the issue by caching opened indexes in the
search.py
module's scope. Now we can invokeawait SearchIndex(index_name="normal-index").index
as many times as desired within one Python process.Implementation Details
Why a global scope? We want to accommodate caching the
Index
across:deepcopy
of anPaperQAEnvironment
, whosetools
which contains aPaperSearch
tool instance.Index
inPaperSearch
, or else we'd be makingIndex
copies (and the side effects of this are unknown)Trajectory
, where we (1) build the index and (2) use the index in 0+ paper searchesTaskDataset
, where we (1) build the index and (2) run 0+ envs for one trajectory eachTaskDataset
, for larger experimentsThis can only be accomplished using global scope, whose lifetime matches the entire Python process. This unfortunately requires callers to invoke the newly added
reap_opened_index_cache
at runtime if intermediary cleaning of the cache is desired.The caching added here can be disabled by setting
1
ortrue
(case insensitive) to the newly-added environment variablePQA_INDEX_DONT_CACHE_INDEXES
.Risks
reap_opened_index_cache
while also usingPaperSearch
reap_opened_index_cache
to avoid this, but the tradeoff is our testing slowly accrues indexes across cases