The issue was caused because the aggregation step was trying to roll up passages to each of the ~1.7M docs in the corpus. policy_id is a high cardinality field (there are lots of unique values), so this operation is slow.
I've added a sampler to limit the number of docs per shard that are considered for aggregation.
In my tests this reduced average query time from ~8 seconds to <4 seconds. We can tweak this parameter further once the instance is deployed if we need.
However, this sampler means that the number of passages per page (thus pages per document) is filtered to the top few per document. This means that when a filter, e.g. on geography, is applied after performing a search, the number of passages/pages returned for each document can increase, as the sampler is acting on fewer documents. To minimise the negative UX impact of this for now, I've:
reduced the maximum number of pages shown per document from 20 to 10
raised #77 to update the UI to indicate this change
Updates
In line with guidance, I've also set a fixed preference string, which can lead to better use of the Opensearch cache.
I've also added size: 0to the aggregation query, as we are only interested in aggregation results.
The issue was caused because the
aggregation
step was trying to roll up passages to each of the ~1.7M docs in the corpus.policy_id
is a high cardinality field (there are lots of unique values), so this operation is slow.I've added a sampler to limit the number of docs per shard that are considered for aggregation.
In my tests this reduced average query time from ~8 seconds to <4 seconds. We can tweak this parameter further once the instance is deployed if we need.
However, this sampler means that the number of passages per page (thus pages per document) is filtered to the top few per document. This means that when a filter, e.g. on geography, is applied after performing a search, the number of passages/pages returned for each document can increase, as the sampler is acting on fewer documents. To minimise the negative UX impact of this for now, I've:
Updates
preference
string, which can lead to better use of the Opensearch cache.size: 0
to the aggregation query, as we are only interested inaggregation
results.closes #72