climatepolicyradar / policy-search

0 stars 1 forks source link

Bug 72 - issue with query response time #78

Closed kdutia closed 2 years ago

kdutia commented 2 years ago

The issue was caused because the aggregation step was trying to roll up passages to each of the ~1.7M docs in the corpus. policy_id is a high cardinality field (there are lots of unique values), so this operation is slow.

I've added a sampler to limit the number of docs per shard that are considered for aggregation.

In my tests this reduced average query time from ~8 seconds to <4 seconds. We can tweak this parameter further once the instance is deployed if we need.

However, this sampler means that the number of passages per page (thus pages per document) is filtered to the top few per document. This means that when a filter, e.g. on geography, is applied after performing a search, the number of passages/pages returned for each document can increase, as the sampler is acting on fewer documents. To minimise the negative UX impact of this for now, I've:

Updates

closes #72

kdutia commented 2 years ago

@chrisaballard updates made, thanks for waiting! See new section of PR description.