Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
6.44k stars 618 forks source link

Caching opened `tantivy.Index`es in the package #627

Closed jamesbraza closed 1 month ago

jamesbraza commented 1 month ago

Motivation

https://github.com/quickwit-oss/tantivy-py/issues/359#issuecomment-2428087169 reveals that one of our servers can only handle three opened indexes at once. The reason why remains unclear, but this PR at least reacts to the issue by caching opened indexes in the search.py module's scope. Now we can invoke await SearchIndex(index_name="normal-index").index as many times as desired within one Python process.

Implementation Details

Why a global scope? We want to accommodate caching the Index across:

This can only be accomplished using global scope, whose lifetime matches the entire Python process. This unfortunately requires callers to invoke the newly added reap_opened_index_cache at runtime if intermediary cleaning of the cache is desired.

The caching added here can be disabled by setting 1 or true (case insensitive) to the newly-added environment variable PQA_INDEX_DONT_CACHE_INDEXES.

Risks