Closed albertisfu closed 2 months ago
Here are the assets used for:
profile_all_child_all_hl_plain_999_limit.json profile_all_child_hl_except_plain.json profile_all_child_hl_only_plain.json profile_all_child_no_hl.json profile_all_child_no_hl_no_exclude_plain.json profile_no_child.json profile_normal_all_child_all_hl.json
Text query: track_all_child_all_hl_plain_999_limit.json track_all_child_hl_only_plain.json track_all_child_no_hl.json track_no_child.json track_normal_all_child_all_hl.json track_all_child_hl_except_plain.json
Number of child documents (inner_hits) displayed
1_child_all_hl.json 2_child_all_hl.json 3_child_all_hl.json 4_child_all_hl.json 5_child_all_hl.json 6_child_all_hl.json
Type of query: track_exact_text_query.json track_filter_case_name_text_query.json track_filter_case_name.json track_match_all.json track_normal_text_query.json
This looks great, thanks for researching this. Elastic sure makes this easier than anything I ever saw with Solr.
I'm not sure what the best next steps would be between this analysis, the opinions DB being almost ready (and easier for testing), and the need to try highlighters with different indexes. What do you suggest?
Yeah, If we are going to try the highlighters with different indexes, I can begin by assessing other highlighters and options using the recap index I used in this research and measure their performance too. This will provide us with a good starting point for testing the highlighters in production, at least in terms of performance. When you're back, it would be a good idea to execute at least one benchmark in production. This will give us a reference for how the cluster is performing with the current highlighters. Then, when we switch the highlighters, we can evaluate the performance again.
Following up on this issue, I've tested other highlighters on my RECAP testing indexes.
As reference I used the HL combinations described in this post that Eduardo shared with me.
These are my findings.
The testing index had: 2,329 Cases and 17,672 Docket Entries taken from the dev database.
Additionally I added a couple of large documents taken from this case where some of their plain_text
has more than 2000 pages. So HL can be assessed in these documents, allowing us to determine which HL works well on plain_text
with more than 1,000,000 characters, which is the current limit we have when using the plain HL.
The HL fields are:
"short_description"
"short_description.exact"
"description"
"description.exact"
"document_type"
"document_type.exact"
"document_number"
"attachment_number"
"plain_text"
"plain_text.exact"
With up to 6 child documents rendered by case in results.
Highlighter | Offset strategy | Index Size | Mean Throughput ops/sec | HL limit |
---|---|---|---|---|
Plain | None | 28.5mb | 7.48 | 999_999 |
Unified | None | 28.5mb | 5.82 | 999_999 |
Unified | Postings | 37.3mb | 7.37 | No limit |
Unified | Terms vector | 63.5mb | 5.94 | No limit |
FVH | Terms vector | 63.5mb | 18.45 | No limit |
From these results, we can observe that using this testing index:
merge_highlights_into_result
method since this HL returns whole phrases highlighted instead of independent terms.This analysis can help us as a starting point for performing similar benchmarking in production so we can choose the best HL for the different indexes.
The good news is that the index changes required for the unified
and fvh
HL can be applied by creating a new index and using the re_index
API, which will copy the whole index and create the required vectors or offsets lists during the process.
So the process perform these benchmarks in prod can be the following:
Apply the required changes to the RECAP index mapping:
index_options="offsets"
to each Text field.term_vector="with_positions_offsets",
to each Text field.Change the RECAP index name in es_indices
.
e.g:
recap_index = Index("recap_postings")
Create the new index.
docker exec -it cl-django python /opt/courtlistener/manage.py search_index --create --models search.Docket
These changes can be done in a maintenance pod.
Then we'll need to copy the current index content to the new index using the re_index
API:
POST
/_reindex
{
"max_docs":10000000, #Number of documents we want to copy.
"source":{
"index":"recap"
},
"dest":{
"index":"recap_postings"
}
}
When the re-indexing is finish we can run the benchmark using this new index.
esrally race --pipeline=benchmark-only --track-path=tracks/recap --target-hosts=localhost:9200 --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:'password'"
The tracks I used for these benchmarks are: track_plain.json track_fhv_vector.json track_unified_vectors.json track_unified_postings.json track_unified_no_offsets_limited.json
Sounds great. Let's get those index/code changes in place for FVH, and then give that a go in prod. Thank you for the detailed research.
Per the instructions in #3550, I created a new index with 2.7M items, then ran esrally using (a slightly tweaked) version of track_fhv_vector.json from above (it needed the index name tweaked):
root@maintenance-ml:/opt/courtlistener# esrally race --pipeline=benchmark-only --track-path=track.json --target-hosts=$ELASTICSEARCH_DSL_HOST --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:'xxx'"
____ ____
/ __ \____ _/ / /_ __
/ /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
/____/
[INFO] Race id is [8193eef4-6da4-4f46-8f29-9a65a9501bf9]
[INFO] Racing on track [track] and car ['external'] with version [8.8.1].
[WARNING] merges_total_time is 44831354 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] merges_total_throttled_time is 7275732 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 115689703 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 16727398 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 24023556 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running cluster-health [100% done]
Running search_recap_1 [100% done]
------------------------------------------------------
_______ __ _____
/ ____(_)___ ____ _/ / / ___/_________ ________
/ /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \
/ __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/
/_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/
------------------------------------------------------
| Metric | Task | Value | Unit |
|---------------------------------------------------------------:|---------------:|-----------------:|-------:|
| Cumulative indexing time of primary shards | | 1928.24 | min |
| Min cumulative indexing time across primary shards | | 0 | min |
| Median cumulative indexing time across primary shards | | 13.451 | min |
| Max cumulative indexing time across primary shards | | 132.828 | min |
| Cumulative indexing throttle time of primary shards | | 0 | min |
| Min cumulative indexing throttle time across primary shards | | 0 | min |
| Median cumulative indexing throttle time across primary shards | | 0 | min |
| Max cumulative indexing throttle time across primary shards | | 0 | min |
| Cumulative merge time of primary shards | | 748.243 | min |
| Cumulative merge count of primary shards | | 124232 | |
| Min cumulative merge time across primary shards | | 0 | min |
| Median cumulative merge time across primary shards | | 7.23336 | min |
| Max cumulative merge time across primary shards | | 72.9313 | min |
| Cumulative merge throttle time of primary shards | | 121.689 | min |
| Min cumulative merge throttle time across primary shards | | 0 | min |
| Median cumulative merge throttle time across primary shards | | 0.308858 | min |
| Max cumulative merge throttle time across primary shards | | 32.9825 | min |
| Cumulative refresh time of primary shards | | 279.09 | min |
| Cumulative refresh count of primary shards | | 1.26767e+06 | |
| Min cumulative refresh time across primary shards | | 0 | min |
| Median cumulative refresh time across primary shards | | 0.329433 | min |
| Max cumulative refresh time across primary shards | | 13.5236 | min |
| Cumulative flush time of primary shards | | 400.4 | min |
| Cumulative flush count of primary shards | | 274151 | |
| Min cumulative flush time across primary shards | | 8.33333e-05 | min |
| Median cumulative flush time across primary shards | | 4.46041 | min |
| Max cumulative flush time across primary shards | | 13.7263 | min |
| Total Young Gen GC time | | 1.189 | s |
| Total Young Gen GC count | | 29 | |
| Total Old Gen GC time | | 0 | s |
| Total Old Gen GC count | | 0 | |
| Store size | | 776.153 | GB |
| Translog size | | 0.145759 | GB |
| Heap used for segments | | 0 | MB |
| Heap used for doc values | | 0 | MB |
| Heap used for terms | | 0 | MB |
| Heap used for norms | | 0 | MB |
| Heap used for points | | 0 | MB |
| Heap used for stored fields | | 0 | MB |
| Segment count | | 1573 | |
| Total Ingest Pipeline count | | 0 | |
| Total Ingest Pipeline time | | 0 | s |
| Total Ingest Pipeline failed | | 0 | |
| Min Throughput | search_recap_1 | 13.69 | ops/s |
| Mean Throughput | search_recap_1 | 15.77 | ops/s |
| Median Throughput | search_recap_1 | 15.83 | ops/s |
| Max Throughput | search_recap_1 | 17.52 | ops/s |
| 50th percentile latency | search_recap_1 | 186.781 | ms |
| 90th percentile latency | search_recap_1 | 245.813 | ms |
| 99th percentile latency | search_recap_1 | 295.616 | ms |
| 100th percentile latency | search_recap_1 | 318.035 | ms |
| 50th percentile service time | search_recap_1 | 186.781 | ms |
| 90th percentile service time | search_recap_1 | 245.813 | ms |
| 99th percentile service time | search_recap_1 | 295.616 | ms |
| 100th percentile service time | search_recap_1 | 318.035 | ms |
| error rate | search_recap_1 | 0 | % |
--------------------------------
[INFO] SUCCESS (took 77 seconds)
--------------------------------
And here's the result with track_plain.json (with tweaks to clients so they align with FHV):
------------------------------------------------------
_______ __ _____
/ ____(_)___ ____ _/ / / ___/_________ ________
/ /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \
/ __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/
/_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/
------------------------------------------------------
| Metric | Task | Value | Unit |
|---------------------------------------------------------------:|---------------:|-----------------:|-------:|
| Cumulative indexing time of primary shards | | 1929.77 | min |
| Min cumulative indexing time across primary shards | | 0 | min |
| Median cumulative indexing time across primary shards | | 13.2665 | min |
| Max cumulative indexing time across primary shards | | 132.9 | min |
| Cumulative indexing throttle time of primary shards | | 0 | min |
| Min cumulative indexing throttle time across primary shards | | 0 | min |
| Median cumulative indexing throttle time across primary shards | | 0 | min |
| Max cumulative indexing throttle time across primary shards | | 0 | min |
| Cumulative merge time of primary shards | | 748.655 | min |
| Cumulative merge count of primary shards | | 124397 | |
| Min cumulative merge time across primary shards | | 0 | min |
| Median cumulative merge time across primary shards | | 7.14795 | min |
| Max cumulative merge time across primary shards | | 72.9445 | min |
| Cumulative merge throttle time of primary shards | | 121.689 | min |
| Min cumulative merge throttle time across primary shards | | 0 | min |
| Median cumulative merge throttle time across primary shards | | 0.298867 | min |
| Max cumulative merge throttle time across primary shards | | 32.9825 | min |
| Cumulative refresh time of primary shards | | 279.378 | min |
| Cumulative refresh count of primary shards | | 1.26933e+06 | |
| Min cumulative refresh time across primary shards | | 0 | min |
| Median cumulative refresh time across primary shards | | 0.32225 | min |
| Max cumulative refresh time across primary shards | | 13.529 | min |
| Cumulative flush time of primary shards | | 400.702 | min |
| Cumulative flush count of primary shards | | 274476 | |
| Min cumulative flush time across primary shards | | 8.33333e-05 | min |
| Median cumulative flush time across primary shards | | 4.06453 | min |
| Max cumulative flush time across primary shards | | 13.7384 | min |
| Total Young Gen GC time | | 1.487 | s |
| Total Young Gen GC count | | 36 | |
| Total Old Gen GC time | | 0 | s |
| Total Old Gen GC count | | 0 | |
| Store size | | 774.318 | GB |
| Translog size | | 0.00284654 | GB |
| Heap used for segments | | 0 | MB |
| Heap used for doc values | | 0 | MB |
| Heap used for terms | | 0 | MB |
| Heap used for norms | | 0 | MB |
| Heap used for points | | 0 | MB |
| Heap used for stored fields | | 0 | MB |
| Segment count | | 1570 | |
| Total Ingest Pipeline count | | 0 | |
| Total Ingest Pipeline time | | 0 | s |
| Total Ingest Pipeline failed | | 0 | |
| Min Throughput | search_recap_1 | 24.15 | ops/s |
| Mean Throughput | search_recap_1 | 24.32 | ops/s |
| Median Throughput | search_recap_1 | 24.33 | ops/s |
| Max Throughput | search_recap_1 | 24.43 | ops/s |
| 50th percentile latency | search_recap_1 | 197.64 | ms |
| 90th percentile latency | search_recap_1 | 245.809 | ms |
| 99th percentile latency | search_recap_1 | 285.975 | ms |
| 100th percentile latency | search_recap_1 | 349.173 | ms |
| 50th percentile service time | search_recap_1 | 197.64 | ms |
| 90th percentile service time | search_recap_1 | 245.809 | ms |
| 99th percentile service time | search_recap_1 | 285.975 | ms |
| 100th percentile service time | search_recap_1 | 349.173 | ms |
| error rate | search_recap_1 | 0 | % |
--------------------------------
[INFO] SUCCESS (took 72 seconds)
--------------------------------
Just to check our work, we're making a new index with plain highlighters. I deleted the term_vector parameters from documents.py, created a new index named recap_plain
, and I'm copying the vector index into it:
POST /_reindex?wait_for_completion=false
{
"source":{
"index":"recap_vectors"
},
"dest":{
"index":"recap_plain"
}
}
When that completed, I re-ran the esrally against it:
____ ____
/ __ \____ _/ / /_ __
/ /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
/____/
[INFO] Race id is [c7c50798-b8a6-4893-890d-bf6a9af3cdd7]
[INFO] Racing on track [track_plain] and car ['external'] with version [8.8.1].
[WARNING] merges_total_time is 45726362 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] merges_total_throttled_time is 7382861 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 121952200 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 16857283 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 26196947 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running cluster-health [100% done]
Running search_recap_1 [100% done]
------------------------------------------------------
_______ __ _____
/ ____(_)___ ____ _/ / / ___/_________ ________
/ /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \
/ __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/
/_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/
------------------------------------------------------
| Metric | Task | Value | Unit |
|---------------------------------------------------------------:|---------------:|-----------------:|-------:|
| Cumulative indexing time of primary shards | | 2032.55 | min |
| Min cumulative indexing time across primary shards | | 0 | min |
| Median cumulative indexing time across primary shards | | 4.86531 | min |
| Max cumulative indexing time across primary shards | | 133.096 | min |
| Cumulative indexing throttle time of primary shards | | 0 | min |
| Min cumulative indexing throttle time across primary shards | | 0 | min |
| Median cumulative indexing throttle time across primary shards | | 0 | min |
| Max cumulative indexing throttle time across primary shards | | 0 | min |
| Cumulative merge time of primary shards | | 762.331 | min |
| Cumulative merge count of primary shards | | 124806 | |
| Min cumulative merge time across primary shards | | 0 | min |
| Median cumulative merge time across primary shards | | 3.02768 | min |
| Max cumulative merge time across primary shards | | 72.9464 | min |
| Cumulative merge throttle time of primary shards | | 123.07 | min |
| Min cumulative merge throttle time across primary shards | | 0 | min |
| Median cumulative merge throttle time across primary shards | | 0.0872083 | min |
| Max cumulative merge throttle time across primary shards | | 32.9825 | min |
| Cumulative refresh time of primary shards | | 281.619 | min |
| Cumulative refresh count of primary shards | | 1.27326e+06 | |
| Min cumulative refresh time across primary shards | | 0 | min |
| Median cumulative refresh time across primary shards | | 0.133883 | min |
| Max cumulative refresh time across primary shards | | 13.5441 | min |
| Cumulative flush time of primary shards | | 436.642 | min |
| Cumulative flush count of primary shards | | 275797 | |
| Min cumulative flush time across primary shards | | 8.33333e-05 | min |
| Median cumulative flush time across primary shards | | 1.976 | min |
| Max cumulative flush time across primary shards | | 13.7568 | min |
| Total Young Gen GC time | | 1.57 | s |
| Total Young Gen GC count | | 37 | |
| Total Old Gen GC time | | 0 | s |
| Total Old Gen GC count | | 0 | |
| Store size | | 786.204 | GB |
| Translog size | | 0.202693 | GB |
| Heap used for segments | | 0 | MB |
| Heap used for doc values | | 0 | MB |
| Heap used for terms | | 0 | MB |
| Heap used for norms | | 0 | MB |
| Heap used for points | | 0 | MB |
| Heap used for stored fields | | 0 | MB |
| Segment count | | 2058 | |
| Total Ingest Pipeline count | | 0 | |
| Total Ingest Pipeline time | | 0 | s |
| Total Ingest Pipeline failed | | 0 | |
| Min Throughput | search_recap_1 | 8.1 | ops/s |
| Mean Throughput | search_recap_1 | 10.56 | ops/s |
| Median Throughput | search_recap_1 | 10.67 | ops/s |
| Max Throughput | search_recap_1 | 12.49 | ops/s |
| 50th percentile latency | search_recap_1 | 183.585 | ms |
| 90th percentile latency | search_recap_1 | 233.26 | ms |
| 99th percentile latency | search_recap_1 | 280.198 | ms |
| 100th percentile latency | search_recap_1 | 347.43 | ms |
| 50th percentile service time | search_recap_1 | 183.585 | ms |
| 90th percentile service time | search_recap_1 | 233.26 | ms |
| 99th percentile service time | search_recap_1 | 280.198 | ms |
| 100th percentile service time | search_recap_1 | 347.43 | ms |
| error rate | search_recap_1 | 0 | % |
--------------------------------
[INFO] SUCCESS (took 92 seconds)
--------------------------------
And to compare disk sizes, we have, vectors:
index store.size
recap_vectors 24.1gb
Plain:
index store.size
recap_plain 11.8gb
The switch to actually using this highlighter seems to be helping, but performance still isn't great. (Though it might get better as the index warms, I'm not sure.)
Alberto also says he has a trick for pagination that might help, so we should attempt that if we can.
Pagination didn't help, but I had a thought tonight while researching something in RECAP. We currently do 20 cases per page, each with five sub-documents, for a total of 100 results/page. If we halved that, would it help with performance? We'd still have very long pages, with 50 items per page?
Yeah, in terms of ES performance, the number of child documents directly impacts it.
Here's one of the graphs posted previously, comparing the retrieval of a varying number of child documents in a query.
This benchmark was performed when using plain_text
highlighting. It could change now that we use FVH. However, we can see that reducing the number of child documents to 3 or 4 can increase the throughput to almost twice that of having 5 child documents.
Additionally, we could consider some kind of improvement for transport and rendering the page.
Sorry, I was unclear. I meant halving the number of parent documents. I assume that'd help too?
Got it! I didn't make a test varying the number of parent documents so I'll do it to measure how much improvement we can get by halving the number of parent documents
Ok I did the benchmark and here are the results:
For a query_string: filed
, which retrieved a full page of parent documents, each with 5 children.
20 Results per page:
| Min Throughput | search_recap_size_20 | 40.37 | ops/s |
| Mean Throughput | search_recap_size_20 | 40.96 | ops/s |
| Median Throughput | search_recap_size_20 | 40.81 | ops/s |
| Max Throughput | search_recap_size_20 | 41.81 | ops/s |
| 50th percentile latency | search_recap_size_20 | 58.9689 | ms |
| 90th percentile latency | search_recap_size_20 | 75.6003 | ms |
| 99th percentile latency | search_recap_size_20 | 97.0243 | ms |
| 100th percentile latency | search_recap_size_20 | 118.104 | ms |
| 50th percentile service time | search_recap_size_20 | 58.9689 | ms |
| 90th percentile service time | search_recap_size_20 | 75.6003 | ms |
| 99th percentile service time | search_recap_size_20 | 97.0243 | ms |
| 100th percentile service time | search_recap_size_20 | 118.104 | ms |
| error rate | search_recap_size_20 | 0 | % |
10 Results per page:
| Min Throughput | search_recap_size_10 | 77.78 | ops/s |
| Mean Throughput | search_recap_size_10 | 79.22 | ops/s |
| Median Throughput | search_recap_size_10 | 79.41 | ops/s |
| Max Throughput | search_recap_size_10 | 80.26 | ops/s |
| 50th percentile latency | search_recap_size_10 | 31.0351 | ms |
| 90th percentile latency | search_recap_size_10 | 38.7568 | ms |
| 99th percentile latency | search_recap_size_10 | 62.0943 | ms |
| 100th percentile latency | search_recap_size_10 | 93.46 | ms |
| 50th percentile service time | search_recap_size_10 | 31.0351 | ms |
| 90th percentile service time | search_recap_size_10 | 38.7568 | ms |
| 99th percentile service time | search_recap_size_10 | 62.0943 | ms |
| 100th percentile service time | search_recap_size_10 | 93.46 | ms |
| error rate | search_recap_size_10 | 0 | % |
track_10_results.json track_20_results.json
So, according to these results, halving parent results roughly doubles the search speed. Final results may vary depending on the query and filters applied, but it seems that indeed there is an improvement in performance from halving the results.
Awesome. Let's do it.
https://github.com/freelawproject/courtlistener/issues/3819 has another performance enhancement we can do for queries ordered by date filed.
This article describe some techniques to improve the search performance in ES.
I've been testing and measuring suggestions related to index changes that could be applied in our case. This way, if we need to modify something in the index, we can do so before starting the new RECAP re-index.
This suggestion implies that if our query uses a query_string
with multiple fields, it's better to merge the content of those fields at search time and only use that field for searching, while preserving the independent fields solely for display purposes.
In ES, the copy-to
parameter allows us to automate the process of copying the content from multiple fields into a single field specifically designated for searching.
Accordingly, I have added a new field to the DocketBaseDocket
called all_fields
and copied all the fields used for querying into this field.
"case_name_full",
"suitNature",
"juryDemand",
"cause",
"assignedTo",
"referredTo",
"court",
"court_id",
"court_citation_string",
"chapter",
"trustee_str",
"short_description",
"plain_text",
"document_type"
Then I did a re_index
from my testing base index:
This is the index with the original size:
recap_index_vector 52.6mb 52.6mb
This is the index after copied multi-fields into the new "all_fields" for search.
recap_vectors_all 75.2mb 75.2mb
Then, I performed multiple benchmarks and query profiling to determine whether this provides a boost in performance or not.
Current approach, query string on multiple fields and Highlighting enabled.
| Min Throughput | search_recap current | 16.56 | ops/s |
| Mean Throughput | search_recap current | 16.67 | ops/s |
| Median Throughput | search_recap current | 16.67 | ops/s |
| Max Throughput | search_recap current | 16.76 | ops/s |
| 50th percentile latency | search_recap current | 169.736 | ms |
| 90th percentile latency | search_recap current | 176.354 | ms |
| 99th percentile latency | search_recap current | 183.92 | ms |
| 99.9th percentile latency | search_recap current | 191.892 | ms |
| 100th percentile latency | search_recap current | 193.15 | ms |
| 50th percentile service time | search_recap current | 169.736 | ms |
| 90th percentile service time | search_recap current | 176.354 | ms |
| 99th percentile service time | search_recap current | 183.92 | ms |
| 99.9th percentile service time | search_recap current | 191.892 | ms |
| 100th percentile service time | search_recap current | 193.15 | ms |
| error rate | search_recap current | 0 | % |
copy_to approach, searching over the new field all_fields
HL enabled.
In this approach, it searches on all_fields, but we still apply highlighting to other fields than the one used for searching. This is possible by setting require_field_match
to False
. So, if a term is found in all_fields
, that term is also highlighted in other fields where it can appear. This is useful for preserving our highlighting behavior across different fields.
| Min Throughput | search_recap all fields Normal | 7.09 | ops/s |
| Mean Throughput | search_recap all fields Normal | 7.17 | ops/s |
| Median Throughput | search_recap all fields Normal | 7.17 | ops/s |
| Max Throughput | search_recap all fields Normal | 7.22 | ops/s |
| 50th percentile latency | search_recap all fields Normal | 392.368 | ms |
| 90th percentile latency | search_recap all fields Normal | 458.192 | ms |
| 99th percentile latency | search_recap all fields Normal | 494.602 | ms |
| 100th percentile latency | search_recap all fields Normal | 516.665 | ms |
| 50th percentile service time | search_recap all fields Normal | 392.368 | ms |
| 90th percentile service time | search_recap all fields Normal | 458.192 | ms |
| 99th percentile service time | search_recap all fields Normal | 494.602 | ms |
| 100th percentile service time | search_recap all fields Normal | 516.665 | ms |
| error rate | search_recap all fields Normal | 0 | % |
We can see that the throughput has not improved and in fact, it's worse.
So I did other measures to identify where the problem was in this approach, which should have been faster.
I suspected the issue was with highlighting using require_field_match = False
and did the following benchmarks.
copy_to approach, searching over the new field all_fields
HL completely disabled
| Min Throughput | search_recap all fields no HL | 72.35 | ops/s |
| Mean Throughput | search_recap all fields no HL | 72.96 | ops/s |
| Median Throughput | search_recap all fields no HL | 73.14 | ops/s |
| Max Throughput | search_recap all fields no HL | 73.22 | ops/s |
| 50th percentile latency | search_recap all fields no HL | 37.6347 | ms |
| 90th percentile latency | search_recap all fields no HL | 44.4349 | ms |
| 99th percentile latency | search_recap all fields no HL | 58.9945 | ms |
| 100th percentile latency | search_recap all fields no HL | 62.8425 | ms |
| 50th percentile service time | search_recap all fields no HL | 37.6347 | ms |
| 90th percentile service time | search_recap all fields no HL | 44.4349 | ms |
| 99th percentile service time | search_recap all fields no HL | 58.9945 | ms |
| 100th percentile service time | search_recap all fields no HL | 62.8425 | ms |
| error rate | search_recap all fields no HL | 0 | % |
Current approach, query string on multiple fields and Highlighting completely disabled.
| Min Throughput | search_recap_int_doc multi no hl | 67.82 | ops/s |
| Mean Throughput | search_recap_int_doc multi no hl | 68.27 | ops/s |
| Median Throughput | search_recap_int_doc multi no hl | 68.25 | ops/s |
| Max Throughput | search_recap_int_doc multi no hl | 68.6 | ops/s |
| 50th percentile latency | search_recap_int_doc multi no hl | 39.3914 | ms |
| 90th percentile latency | search_recap_int_doc multi no hl | 49.0279 | ms |
| 99th percentile latency | search_recap_int_doc multi no hl | 72.2078 | ms |
| 100th percentile latency | search_recap_int_doc multi no hl | 106.778 | ms |
| 50th percentile service time | search_recap_int_doc multi no hl | 39.3914 | ms |
| 90th percentile service time | search_recap_int_doc multi no hl | 49.0279 | ms |
| 99th percentile service time | search_recap_int_doc multi no hl | 72.2078 | ms |
| 100th percentile service time | search_recap_int_doc multi no hl | 106.778 | ms |
| error rate | search_recap_int_doc multi no hl | 0 | % |
We can see an slight improvement of the all_fields
approach when disabling highlighting.
Then performed other benchmarks to confirm the theory.
copy_to approach, searching over the new field all_fields
HL only over all_fields
In this approach all_fields
is used for search and also for HL, disabling HL for all the other fields.
| Min Throughput | search_recap all fields HL | 16.89 | ops/s |
| Mean Throughput | search_recap all fields HL | 17.02 | ops/s |
| Median Throughput | search_recap all fields HL | 17.04 | ops/s |
| Max Throughput | search_recap all fields HL | 17.09 | ops/s |
| 50th percentile latency | search_recap all fields HL | 170.136 | ms |
| 90th percentile latency | search_recap all fields HL | 181.265 | ms |
| 99th percentile latency | search_recap all fields HL | 232.612 | ms |
| 100th percentile latency | search_recap all fields HL | 245.52 | ms |
| 50th percentile service time | search_recap all fields HL | 170.136 | ms |
| 90th percentile service time | search_recap all fields HL | 181.265 | ms |
| 99th percentile service time | search_recap all fields HL | 232.612 | ms |
| 100th percentile service time | search_recap all fields HL | 245.52 | ms |
| error rate | search_recap all fields HL | 0 | % |
We can see that the throughput is slightly better (though not statistically significant) than the current approach, which involves searching over multiple fields and applying highlighting over multiple fields as well. However, this comes with the disadvantage of losing HL for independent fields.
The issue of all_fields
with HL is illustrated in the following query profiles:
Current approach, query string on multiple fields and Highlighting enabled.
BooleanQuery | 0.558718 |
---|---|
rewrite_time | 0.633584 |
collector | 0.016288 |
children FetchSourcePhase | 0.018292 |
Children: HighlightPhase | 34.038625 |
Children: InnerHitsPhase | 121.066542 |
Children: StoredFieldsPhase | 0.009167 |
Total: | 156.341216 |
copy_to approach, searching over the new field all_fields HL enabled. |
BooleanQuery | 0.445343 |
---|---|---|
rewrite_time | 0.525458 | |
collector | 0.013835 | |
children FetchSourcePhase | 0.510458 | |
Children: HighlightPhase | 20.785165 | |
Children: InnerHitsPhase | 314.997749 | |
Children: StoredFieldsPhase | 0.014709 | |
Total: | 337.292717 |
The BooleanQuery
time is slightly better in the all_fields
approach; however, we can see in the InnerHitsPhase (which includes the highlighting of the most expensive field, plain_text
) how applying HL to all the other fields using require_field_match = False
is costly.
In summary, using a single field for searching seems to offer a slight improvement in performance compared to the multi-field approach. However, in our use case, the query phase that consumes the most time is highlighting (specifically, plain_text HL). Highlighting fields other than the one matched is expensive, resulting in worse overall performance. As an alternative, we could consider applying HL only to the all_fields
field. However, as shown above, the performance improvement is not significantly better than the current approach, making it not worthwhile to lose HL in all the other fields.
I also tested this suggestion.
The suggestion says that keyword
fields perform better when used in a term
filter, while ints
and long
fields are optimized for range queries. Therefore, if we have ints
and long
fields that are not used in range queries, they should be indexed as keywords instead.
To evaluate this, I re-indexed some int
fields as keyword
(document_number
and attachment_number
) and conducted a benchmark.
The test query only used a term filter for document_number
document_number
indexed as int
| Min Throughput | search_recap_int_doc | 265.34 | ops/s |
| Mean Throughput | search_recap_int_doc | 265.34 | ops/s |
| Median Throughput | search_recap_int_doc | 265.34 | ops/s |
| Max Throughput | search_recap_int_doc | 265.34 | ops/s |
| 50th percentile latency | search_recap_int_doc | 6.85777 | ms |
| 90th percentile latency | search_recap_int_doc | 11.1501 | ms |
| 99th percentile latency | search_recap_int_doc | 18.3515 | ms |
| 100th percentile latency | search_recap_int_doc | 33.2133 | ms |
| 50th percentile service time | search_recap_int_doc | 6.85777 | ms |
| 90th percentile service time | search_recap_int_doc | 11.1501 | ms |
| 99th percentile service time | search_recap_int_doc | 18.3515 | ms |
| 100th percentile service time | search_recap_int_doc | 33.2133 | ms |
| error rate | search_recap_int_doc | 0 | % |
document_number
indexed as keyword
| Min Throughput | search_recap_keyword_doc | 264.87 | ops/s |
| Mean Throughput | search_recap_keyword_doc | 264.87 | ops/s |
| Median Throughput | search_recap_keyword_doc | 264.87 | ops/s |
| Max Throughput | search_recap_keyword_doc | 264.87 | ops/s |
| 50th percentile latency | search_recap_keyword_doc | 8.38983 | ms |
| 90th percentile latency | search_recap_keyword_doc | 16.3272 | ms |
| 99th percentile latency | search_recap_keyword_doc | 32.2095 | ms |
| 100th percentile latency | search_recap_keyword_doc | 47.2592 | ms |
| 50th percentile service time | search_recap_keyword_doc | 8.38983 | ms |
| 90th percentile service time | search_recap_keyword_doc | 16.3272 | ms |
| 99th percentile service time | search_recap_keyword_doc | 32.2095 | ms |
| 100th percentile service time | search_recap_keyword_doc | 47.2592 | ms |
| error rate | search_recap_keyword_doc | 0 | % |
I performed multiple benchmark here because the results were quite similar, with one occasionally outperforming the other and vice versa. I think in this case, the query is so fast that it can be significantly affected by the testing environment's noise. In conclusion, I can't say if one is better than the other. Perhaps running a test on a larger index in production could tell us more conclusive results. However, generally, it seems there is not much improvement from moving from int
to keyword
.
However, I found another reason why moving to keyword
could be a good idea. Currently, we have some IDs set to ints
, which have a maximum value of 2^31 or 2,147,483,648, so it seems like a large limit, but thinking about the future, it might be better to have a higher limit, especially for RECAPDocuments
and Dockets
.
I still need to investigate an additional suggestion related to phrase search. I'll report my findings here as well.
Continuing to evaluate the potential search performance improvements that could be applied to our use case, I assessed:
index_phrases
This setting seemed promising; enabling the index_phrases option on .exact fields showed a significant improvement in phrase queries. The idea behind enabling this option is that two-term word combinations (shingles) are indexed into a separate field, giving a boost to exact phrase queries.
I enabled the option for all the RECAP exact fields we use for query_string
and performed a benchmark.
Phrase search query without enabling index_phrases
| Min Throughput | search_recap phrase normal | 187.55 | ops/s |
| Mean Throughput | search_recap phrase normal | 188.03 | ops/s |
| Median Throughput | search_recap phrase normal | 187.55 | ops/s |
| Max Throughput | search_recap phrase normal | 188.99 | ops/s |
| 50th percentile latency | search_recap phrase normal | 13.7903 | ms |
| 90th percentile latency | search_recap phrase normal | 20.0521 | ms |
| 99th percentile latency | search_recap phrase normal | 29.4487 | ms |
| 100th percentile latency | search_recap phrase normal | 34.999 | ms |
| 50th percentile service time | search_recap phrase normal | 13.7903 | ms |
| 90th percentile service time | search_recap phrase normal | 20.0521 | ms |
| 99th percentile service time | search_recap phrase normal | 29.4487 | ms |
| 100th percentile service time | search_recap phrase normal | 34.999 | ms |
| error rate | search_recap phrase normal | 0 | % |
Phrase search query with index_phrases
enabled.
| Min Throughput | search_recap phrase shingles | 314.82 | ops/s |
| Mean Throughput | search_recap phrase shingles | 317.55 | ops/s |
| Median Throughput | search_recap phrase shingles | 317.55 | ops/s |
| Max Throughput | search_recap phrase shingles | 320.28 | ops/s |
| 50th percentile latency | search_recap phrase shingles | 6.80929 | ms |
| 90th percentile latency | search_recap phrase shingles | 10.6978 | ms |
| 99th percentile latency | search_recap phrase shingles | 21.9254 | ms |
| 100th percentile latency | search_recap phrase shingles | 31.1703 | ms |
| 50th percentile service time | search_recap phrase shingles | 6.80929 | ms |
| 90th percentile service time | search_recap phrase shingles | 10.6978 | ms |
| 99th percentile service time | search_recap phrase shingles | 21.9254 | ms |
| 100th percentile service time | search_recap phrase shingles | 31.1703 | ms |
| error rate | search_recap phrase shingles | 0 | % |
As it can be seen, there appears to be a significant improvement in exact phrase search by enabling this option. However, after applying this option and running tests, all tests involving phrase search and highlighting failed.
Doing some research it seems this is a ES bug. So, the observed boost might have been due solely to no highlighting being applied to the results. In any case, since we cannot stop doing highlighting on phrase queries, this option does not seem viable for us.
I also evaluated this option, where global ordinals are calculated and stored as part of the field cache. They help improve the performance of aggregations, as well as operations involving a join
field, like in our case with has_child
and has_parent
queries.
Thus, by enabling this option, ES will construct and cache the global ordinals before search requests are performed, instead of building them as they are requested.
I enabled this option and conducted a benchmark:
eager_global_ordinals
disabled
| Min Throughput | search_recap no globals | 15.25 | ops/s |
| Mean Throughput | search_recap no globals | 15.32 | ops/s |
| Median Throughput | search_recap no globals | 15.32 | ops/s |
| Max Throughput | search_recap no globals | 15.47 | ops/s |
| 50th percentile latency | search_recap no globals | 186.94 | ms |
| 90th percentile latency | search_recap no globals | 237.895 | ms |
| 99th percentile latency | search_recap no globals | 272.522 | ms |
| 100th percentile latency | search_recap no globals | 291.286 | ms |
| 50th percentile service time | search_recap no globals | 186.94 | ms |
| 90th percentile service time | search_recap no globals | 237.895 | ms |
| 99th percentile service time | search_recap no globals | 272.522 | ms |
| 100th percentile service time | search_recap no globals | 291.286 | ms |
| error rate | search_recap no globals | 0 | % |
eager_global_ordinals
enabled.
| Min Throughput | search_recap eager global | 15.39 | ops/s |
| Mean Throughput | search_recap eager global | 15.43 | ops/s |
| Median Throughput | search_recap eager global | 15.42 | ops/s |
| Max Throughput | search_recap eager global | 15.54 | ops/s |
| 50th percentile latency | search_recap eager global | 183.913 | ms |
| 90th percentile latency | search_recap eager global | 229.807 | ms |
| 99th percentile latency | search_recap eager global | 248.07 | ms |
| 100th percentile latency | search_recap eager global | 256.371 | ms |
| 50th percentile service time | search_recap eager global | 183.913 | ms |
| 90th percentile service time | search_recap eager global | 229.807 | ms |
| 99th percentile service time | search_recap eager global | 248.07 | ms |
| 100th percentile service time | search_recap eager global | 256.371 | ms |
| error rate | search_recap eager global | 0 | % |
This benchmark was performed with the cache enabled to see if there was any advantage in warming up the global ordinals. The average throughput is almost the same. However, without using the setting, the slowest request took 291.286ms
, while with the option enabled, the slowest request took 256.371ms
. No a great difference, but it could means that warming up global ordinals could offer a slight boost the first time a query is performed, with the tradeoff of increasing heap memory usage.
These findings, along with those reported in the previous comment, were the potential changes that could lead to a performance boost in search queries and that might require a change on the document mapping or re_index. However, none of them seem to indicate a real improvement, or if there is an improvement, it affects other essential features like highlighting.
@mlissner to confirm if there is anything else we can do to improve search performance. Could we run the following profiles on Kibana?
GET recap_index_vector/_search?request_cache=false
As body, apply each of the following queries: normal_query.json single_child_field_query.json
In that way we could see how's queries working on prod and see if we can try something different to improve their performance.
In the article, there are other tweaks not related to indexing or mapping changes but to the filesystem and hardware that we could also consider.
Give memory to the filesystem cache
Avoid page cache thrashing by using modest readahead values on Linux
Let's go through the performance stuff together, Alberto. I'll find a minute, maybe tomorrow.
Did you see this comment about working around the phrase indexing bug:
https://github.com/elastic/elasticsearch/issues/40227#issuecomment-504630378
Let's go through the performance stuff together, Alberto. I'll find a minute, maybe tomorrow.
Great, thanks!
Did you see this comment about working around the phrase indexing bug:
Yeah, I'll test this workaround to see if it can be a solution for us since it's mentioned that highlighting for phrases doesn't work completely well using this approach.
I tried different approaches related to the workaround mentioned in the issue comment.
The main problem appears to be that index_phrases
is not compatible with phrase highlighting because the query matches phrases using shingles stored in a different field, as described in the documentation. Meanwhile, highlighting is applied to the main field.
So, the workaround that worked best involves using the highlight_query
setting with a multi_match
query, since in our use case, we need to query multiple fields.
This approach is works because the multi_match
query does not search the exact
field, unlike the query_string
with the option "quote_field_suffix": ".exact"
, which uses the exact version of the field to match phrases. Thus, it utilizes the normal version of the field (which does not have the index_phrases
option enabled) to search for terms and apply highlighting.
"highlight_query":{
"bool":{
"should":[
{
"multi_match":{
"fields":[
"case_name_full",
"suitNature",
"juryDemand",
"cause",
"assignedTo",
"referredTo",
"court",
"court_id",
"court_citation_string",
"chapter",
"trustee_str",
"short_description",
"plain_text",
"document_type",
"caseName^4.0",
"docketNumber^3.0",
"description^2.0"
],
"query":"\"apple inc\""
}
},
{
"multi_match":{
"fields":[
"case_name_full",
"suitNature",
"juryDemand",
"cause",
"assignedTo",
"referredTo",
"court",
"court_id",
"court_citation_string",
"chapter",
"trustee_str",
"short_description",
"plain_text",
"document_type",
"caseName^4.0",
"docketNumber^3.0",
"description^2.0"
],
"query":"\"apple inc\"",
"type":"phrase"
}
}
],
"minimum_should_match":1
}
The problem with this approach is that it only works partially, as it highlights single terms instead of the entire phrase. As a result, the terms highlighted are not the same as those by which the main query matched the document.
Here is an example:
This is applying the the highlight_query
workaround. Highlights are displayed, but instead of being applied to the phrase "apple inc," they are applied independently to the terms "apple" or "inc."
This is the current approach does not use index_phrases
, and the difference is that the phrases are highlighted as expected.
I also applied a benchmark to this workaround, and these are the results: track_index_phrase_workraround.json
| Min Throughput | search_recap phrase shingles | 203.42 | ops/s |
| Mean Throughput | search_recap phrase shingles | 203.42 | ops/s |
| Median Throughput | search_recap phrase shingles | 203.42 | ops/s |
| Max Throughput | search_recap phrase shingles | 203.42 | ops/s |
| 50th percentile latency | search_recap phrase shingles | 10.8529 | ms |
| 90th percentile latency | search_recap phrase shingles | 17.0272 | ms |
| 99th percentile latency | search_recap phrase shingles | 33.1628 | ms |
| 100th percentile latency | search_recap phrase shingles | 41.537 | ms |
| 50th percentile service time | search_recap phrase shingles | 10.8529 | ms |
| 90th percentile service time | search_recap phrase shingles | 17.0272 | ms |
| 99th percentile service time | search_recap phrase shingles | 33.1628 | ms |
| 100th percentile service time | search_recap phrase shingles | 41.537 | ms |
| error rate | search_recap phrase shingles | 0 | % |
We can confirm that part of the boost observed in yesterday's benchmark was due to the lack of highlighting. In this approach, with highlighting partially working, the throughput has decreased and is now closer to the original approach without index_phrases
.
track_phrase_normal.json
| Min Throughput | search_recap phrase normal | 187.55 | ops/s |
| Mean Throughput | search_recap phrase normal | 188.03 | ops/s |
| Median Throughput | search_recap phrase normal | 187.55 | ops/s |
| Max Throughput | search_recap phrase normal | 188.99 | ops/s |
| 50th percentile latency | search_recap phrase normal | 13.7903 | ms |
| 90th percentile latency | search_recap phrase normal | 20.0521 | ms |
| 99th percentile latency | search_recap phrase normal | 29.4487 | ms |
| 100th percentile latency | search_recap phrase normal | 34.999 | ms |
| 50th percentile service time | search_recap phrase normal | 13.7903 | ms |
| 90th percentile service time | search_recap phrase normal | 20.0521 | ms |
| 99th percentile service time | search_recap phrase normal | 29.4487 | ms |
| 100th percentile service time | search_recap phrase normal | 34.999 | ms |
| error rate | search_recap phrase normal | 0 | % |
So, in conclusion, there appears to still exists a slight advantage to the index_phrases
approach. However, this advantage comes at the expense of highlights on phrases not fully working, which can confuse users when viewing the results.
Thanks Alberto. Having been confused by wrong highlighting in our Solr implementation, I think it's definitely not worth it.
But, roughly, how much faster are things without highlighting altogether? Maybe it's worth just turning off highlighting completely? Is it adding that much functionality to the results? Maybe it's not? I just checked Bing, Kagi, and Google. None use highlighting anymore.
Sorry, correction:
But, roughly, how much faster are things without highlighting altogether?
Sure, here are some numbers.
I conducted various tests because the time it takes to highlight depends on the size of the documents being matched; if the documents matched are large, highlighting takes more time.
For instance.
Query: apple
Among the results, there is a document that contains more than 10,000 characters.
HL enabled.
| Min Throughput | search_recap normal query HL | 15.08 | ops/s |
| Mean Throughput | search_recap normal query HL | 15.14 | ops/s |
| Median Throughput | search_recap normal query HL | 15.16 | ops/s |
| Max Throughput | search_recap normal query HL | 15.23 | ops/s |
HL disabled.
| Min Throughput | search_recap normal query no HL | 73.13 | ops/s |
| Mean Throughput | search_recap normal query no HL | 73.65 | ops/s |
| Median Throughput | search_recap normal query no HL | 73.47 | ops/s |
| Max Throughput | search_recap normal query no HL | 74.36 | ops/s |
In this example, we can observe significantly better performance when highlighting is disabled
Another example.
Query: states
No big document within results.
HL enabled.
| Min Throughput | search_recap states query HL | 101.79 | ops/s |
| Mean Throughput | search_recap states query HL | 102.43 | ops/s |
| Median Throughput | search_recap states query HL | 102.6 | ops/s |
| Max Throughput | search_recap states query HL | 102.89 | ops/s |
HL disabled.
| Min Throughput | search_recap states query no HL | 190.92 | ops/s |
| Mean Throughput | search_recap states query no HL | 192.94 | ops/s |
| Median Throughput | search_recap states query no HL | 192.94 | ops/s |
| Max Throughput | search_recap states query no HL | 194.96 | ops/s |
In this example, the version without highlighting is faster; however, the difference is not as significant as we observed in the first example.
Query: "apple inc"
HL enabled, normal index.
| Min Throughput | search_recap normal phrase HL | 196.73 | ops/s |
| Mean Throughput | search_recap normal phrase HL | 198.05 | ops/s |
| Median Throughput | search_recap normal phrase HL | 198.05 | ops/s |
| Max Throughput | search_recap normal phrase HL | 199.38 | ops/s |
HL disabled, normal index.
| Min Throughput | search_recap normal phrase NO HL | 385.41 | ops/s |
| Mean Throughput | search_recap normal phrase NO HL | 385.41 | ops/s |
| Median Throughput | search_recap normal phrase NO HL | 385.41 | ops/s |
| Max Throughput | search_recap normal phrase NO HL | 385.41 | ops/s |
HL disabled, index_phrases
index.
| Min Throughput | search_recap shingles phrase NO HL | 480.15 | ops/s |
| Mean Throughput | search_recap shingles phrase NO HL | 491.01 | ops/s |
| Median Throughput | search_recap shingles phrase NO HL | 491.57 | ops/s |
| Max Throughput | search_recap shingles phrase NO HL | 500.07 | ops/s |
From these results, we can see how the version without highlighting performs on a search phrase, and its performance is slightly better when the index_phrases
option is used simultaneously.
In general, we can observe that, for queries that do not return large documents, the performance after disabling HL is twice as fast. For queries that return large documents, disabling highlighting significantly improves performance; however, it depends on the number of documents and their size.
However, there is an additional variable we should also consider: the number of documents in the index.
I performed these benchmarks on a small index with around 20,000 documents. The size of the highlighting component in production might vary due to the number of documents.
Therefore, we need to at least perform some query profiling to determine which query components are taking more time in production.
To do this, I'm attaching some profiles that could be useful to get better insights into how highlighting is impacting production.
GET recap_index_vector/_search?request_cache=false
medium-query-no-hl.json medium-query-hl.json big-query-hl.json big-query-no-hl.json big-documents-query-no-hl copy.json big-documents-query-hl copy.json
Two additional profiles should be performed in production; these are related to count queries so that we can determine how they impact the overall request time on prod
OK, here are your responses, I believe. Note that one failed (big-query-hl.json):
Great, thank you. I'll analyze the results and get back to you with my findings so we can start the sweep indexer.
@mlissner I've analyzed the profile responses and omitted the query profile that failed.
At first sight, I thought there was something wrong with production because the profile sizes were huge compared to those from my local environment. However, that's not the case. The reason profiles are huge on production is that it reports a profile for each shard in the cluster, resulting in 30 profiles per file.
I created two scripts to analyze the profiles because, due to the size of each file, it is complex to analyze them by hand.
The first script: analyze_query_components.txt
Extracted the key components of the profile and computed the total time in milliseconds per shard. Here are the results: big-documents-query-hl_analize.json big-documents-query-no-hl_analize.json big-query-no-hl_analyze.json dockets_count_query_analize.json fillings_count_query_analize.json medium-query-hl_analize.json medium-query-no-hl_analize.json
Here is the breakdown for one shard for the query: big-documents-query-hl
{
"id": "[3zrjVUmdQbejwIdnw1XrAA][recap_vectors][1]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"time_ms": 245.673662
}
],
"rewrite_time": 626.626444,
"total_time_search": 872.300106
}
],
"fetch_time_ms": 4133.962441
"total_time_shard": 5006.262546999999,
}
We can observe that the BooleanQuery
time totals around 245 ms.
The rewrite_time is around 626 ms, this is the time elasticsearch takes to optimize the query.
The fetch_time_ms
is around 4,000 ms, which includes HL (Highlighting) and retrieving documents from the index.
From this, we can see that the most time expended on an ES request is spent on HL and retrieving documents. The query time is not significant, but if required.
A review of the query might be especially useful for queries that match millions of documents at once. For instance, the big-query-no-hl-response query matches 53,089,192 Cases and 294,689,707 RECAPDocuments.
It has the following breakdown for a shard:
{
"id": "[3zrjVUmdQbejwIdnw1XrAA][recap_vectors][4]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"time_ms": 5503.596591
}
],
"rewrite_time": 5252.904828,
"total_time_search": 10756.501419
}
],
"total_time_shard": 10756.501419
}
This query is not applying HL (the HL version of the same query was the one that timed out). So, we can see that when matching a massive amount of documents, the query time can also be expensive. Here, we could later try to find if it's possible to disable or remove some query clauses to simplify the overall query while preserving the quality of results. This review won't require a reindex.
The second scripts prints the total time per shard in milliseconds and adds some stats at the end, useful for their analysis: analyze_profiles_totals.txt
{
"average_shard_time_ms": 9689.299090133332,
"max_time_shard": "[nmLgdM-aSDCkZNmyVoOodg][recap_vectors][19]",
"max_time_shard_time_ms": 20040.880836,
"global_time_ms": 290678.97270399996,
"shard_count": 30,
"took_ms": 18206
}
It returns:
Here are the results: big-query-no-hl-simplified-totals.json big-documents-query-hl-simplified-totals.json big-documents-query-no-hl-simplified-totals.json dockets_count_query-simplified-totals.json fillings_count_query-simplified-totals.json medium-query-hl-simplified-totals.json medium-query-no-hl-simplified-totals.json
As conclusion I can see two reasons why the performance is being affected in prod.
From the production profiles we can confirm that highlighting is expensive, specially on certain escenarios.
big-documents-query-hl profile
This query matched documents with large plain_text
fields, so we expected highlighting to be expensive here.
The metrics of interest are:
"took_ms": 8419
"max_time_shard_time_ms": 5870.834745
"average_shard_time_ms": 2226.520687433333
We can see the query took
time is around 8 seconds.
While the average time it took to process the query on each shard was around 2 seconds, considering that shard operations run in parallel, the global query time should be closer to the time taken by the shard that takes the longest to process the query, which was close to 6 seconds.
The actual took
time is longer, which makes sense since profiling introduces additional overhead and some other components that the profile doesn't measure, such as the time spent in the queue, merging shard responses on the coordinating node, and additional work described in documentation.
big-documents-query-no-hl profile This is the same previous query but without HL.
Metrics of interest are: "took_ms": 1667 "max_time_shard_time_ms": 1341.438036 "average_shard_time_ms": 834.2702551
From these metrics, we can see that in the worst-case scenario, when highlighting is too expensive, a query without HL could be around 5 times faster.
medium-query-hl profile This is a query that matched 243,854 Cases and 1,571,324 Docket Entries Not targeted to match big documents. HL enabled.
Metrics of interest are: "took_ms": 2532 "max_time_shard_time_ms": 1752.211235 "average_shard_time_ms": 638.510290
medium-query-no-hl profile Same query HL disabled.
Metrics of interest are: "took_ms": 1257 "max_time_shard_time_ms": 1269.419189 "average_shard_time_ms": 961.2542206
Since this can be a regular query, we can see that on average, disabling HL can make a query twice as fast.
It's important to note that removing HL will also eliminate the possibility of showing snippets with the plain_text
content that was matched by the query.
In RECAP, in addition to the main query, we perform two additional queries: - Dockets matched count query
Dockets count: 53,087,704 "took_ms": 12181 "max_time_shard_time_ms": 9087.384 "average_shard_time_ms": 6652.13974
- RECAPDocuments count query. Documents count: 294,651,169 "took_ms": 7810 "max_time_shard_time_ms": 25611.872 "average_shard_time_ms": 5075.93326
From these results, we can observe that count queries are also expensive, especially when they match a large number of documents. Something strange to note is that in the RECAPDocuments query, it reports that a shard took around 25 seconds to complete the operation, while the global took time is just around 7 seconds, which doesn't make sense. Thus, the real took
time here could be even longer.
Even though these count queries and the main query are executed in parallel, they could be impacting the overall speed for the user in case a count query takes longer than the main query. Additionally, these extra queries being executed simultaneously might be impacting the overall cluster response time.
Something we can do is add a waffle that allows us to test queries without performing count queries, so we can compare the performance improvement directly in production.
Another suggestion Mike proposed is that we could also try to terminate the count queries if they last more than a short amount of time and just show something like +10,000 documents
.
From these profiles, we can observe that the time each shard takes to process the query varies significantly, depending primarily on two factors:
However, it's not straightforward to determine whether increasing or decreasing the number of shards will help with performance. Increasing the number of shards leads to smaller shards, which are faster to query, but to see a real improvement, this would also require increasing the number of nodes proportionally, so operations can be parallelized. Otherwise, increasing the number of shards could lead to a degradation in performance since it adds overhead to nodes in addition to the coordination and merging results time.
In small clusters, fewer shards could work better due to reduced merging time and coordination effort. https://stackoverflow.com/a/66624665
If we want to explore this alternative, it will require testing multiple configurations to determine what is best according to our setup and resources. Changing the number of shards will necessitate the creation of a new index, and the content can be copied using the re_index API.
Unless you want to test changing the number of shards (which might impact the re_index). I think we're ready for starting the new re index once #3871 is merged.
Thanks. This is a lot to think about, but it seems like the results are:
Highlighting is expensive, but we can't turn it off because then we'd lose snippets.
Counts are expensive. It looks like we might be able to use track_total_hits
to help with this though.
We discussed this a bit in https://github.com/freelawproject/courtlistener/pull/3648, but I feel like we didn't consider just setting it to an integer and allowing the front end to do the More than 1,000 hits
kind of thing.
We could tinker with shards. More might help, or not.
Something else we haven't discussed (because it's better to think about this after we've optimized the software) is optimizing hardware. I don't actually know where our bottleneck is, whether it's CPU, disk, or networking. I just looked and things seem sort of OK to me, but I guess we can ask Ramiro to do this soon.
Anyway, bottom line, I agree we can set this aside for later.
I spun off my summarizing comment above into https://github.com/freelawproject/courtlistener/issues/4209, where we can continue this in a fresh issue. Closing this mega conversation with my appreciation and gratitude.
@mlissner As we talked about, I've been investigating this issue to determine what is causing some queries to be too slow. Spoiler: highlighting the
plain_text
field is the main reason.Query Profiling I started by profiling a text query for RECAPSearch.
I populated my testing cluster with 2,079 cases and 15,936 entries, all of them took from the dev database, so it's real data.
To profile a query it's only required to add
"profile":true
to the query request body and set the cache tofalse
.And it will return the normal results with an additional profile key. Here is an example: profile_response.txt
The profiled query was a simple text query:
?type=r&q=apple&order_by=score%20desc
These are the results.
Normal, all child, all HL This is the query currently being used in production; it includes up to 6 child documents with the following highlighted fields for each child doc:
In this graph showing the results, it is possible to observe the time taken for each query component, measured in ms:
From this, we can see that the phase taking the most time is
InnerHitsPhase
.Adding up all the phases, we get a total of
235.908916 ms
for the entire request.No child at all Then I performed the same query excluding all the child documents (inner_hits) from the results:
Total query time:
2.963787 ms
From this, we can confirm that the phase taking the most time is the display of child documents.All child, no HL In the following query, I tested keeping child documents but removing the HL of child fields.
Total query time:
10.64146 ms
From here, we can see that including child documents in the results without applying highlighting increases the query time, but not as significantly as the query that includes HL.
So, the main problem appears to be the highlighting of child documents.
All child, excluding plain_text HL The following query includes child documents and HL for all the child fields except for
plain_text
Total query time:
14.377129 ms
Here we can see that HL of
plain_text
is taking the most time.All child, only plain_text HL And we can confirm that by checking the inverse query, including only the
plain_text
HL.Total query time:
219.818077 ms
From the following query, we can confirm that the issue is not related to extracting
plain_text
and sending it in the results. This is considering that our query excludesplain_text
from_source
and instead always displays aplain_text
fragment via highlighting when the field contains content. All childs, no HL, avoid excluding plaint_textTotal query time:
8.356079 ms
Finally from the following query profile we can confirm the issue is the HL of of
plain_text
and is directly related to the amount of content to be HL. By default our current HL process until999_999
chars which is the max allowed. Reducing it to999
, decreased the query time.All childs, All HL, limit plain_text HL to 999
Total query time:
18.753498 ms
We can observe that there is the
HighlightPhase
, which doesn't take much time, and also theInnerHitsPhase
. According to these tests, it seems that theInnerHitsPhase
performs the highlighting for child fields and also processes the extraction of child documents and their fields to render them into the results.In the following graph, we can clearly observe all the previous queries and their respective times:
Benchmarking
The query times shown in the previous profiles vary with each request. I tried to take an average value, but they're not statistically representative. Consequently, I decided to also perform benchmarking to obtain more representative values, including the variable of load. We can replicate this benchmarking in our production cluster to have a reference before launching into production.
The benchmarks were conducted on my local testing cluster, which contains the same number of documents as reported in the previous profiles.
Using a current cluster with documents ensures that results are more accurate compared to an isolated environment with limited documents and different resources from the production cluster.
If we run these benchmarks in the production cluster, it is advisable to do so before going live, as the benchmarks are resource-intensive. It's also recommended to ensure snapshots are up-to-date as a precaution in case something goes wrong.
We'll need a pod that has connectivity to the production cluster via the load balancer. And install rally.
pip3 install esrally
Then we need to place the race assets in a directory. (I'll attach all the assets in a following comment.)
And then run:
esrally race --pipeline=benchmark-only --track-path=tracks/recap --target-hosts=localhost:9200 --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:'password'"
After tweaking the client options.
Here are the results I got from these benchmarks.
The benchmarking requests have the following parameters:
6 child, all HL This is the query that is currently being used in production, it includes up to 6 child documents with all the selected fields HL.
6 child, No HL All the child documents but not HL.
6 child, HL only
plain_text
6 child, all HL except `plain_text
No child Query that doesn't display
inner_hits
at all.6 Childs, all HL, limit
plain_text
HL to 999 chars Query with all the child fields HL but limitingplain_text
HL to 999 chars.In the following graph we can see these results more clearly.
I also conducted some additional benchmarks to evaluate other variables.
Like the number of child documents (inner_hits) displayed.
From here, we can see that with all the HL activated, we achieve the best performance by retrieving only 1 child document. From 2 child documents onwards, performance starts to decrease.
Or the type of query requested.
From here, we can see that performance also depends on the number of documents matched and the type of query requested. For instance, a
Match All
query has better performance than a text query with fewer documents matched, because theMatch All
query does not perform highlighting.A filter or a filter combined with a text query can have good performance as long as only a few documents are matched.
In brief:
From these benchmarks and the previous profiles, we can conclude that the process taking the most time is the highlighting of
plain_text
. The worst performance is observed in our current query, which includes 6 child documents and highlights all selected child fields. You can notice that best performance is achieved with a query that does not display child documents. However, since displaying child documents is mandatory for our use case, our optimization goal could be to achieve a performance similar to the query '6 children, all HL except plain_text'. Therefore, we should consider the performance when assessing other highlighters in #3300 to see if it's possible to gain additional efficiency in this process.