Assess Elasticsearch Query Performance

freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.

https://www.courtlistener.com

Other

538 stars 148 forks source link

Assess Elasticsearch Query Performance #3536

Closed albertisfu closed 2 months ago

albertisfu commented 9 months ago

@mlissner As we talked about, I've been investigating this issue to determine what is causing some queries to be too slow. Spoiler: highlighting the plain_text field is the main reason.

Query Profiling I started by profiling a text query for RECAPSearch.

I populated my testing cluster with 2,079 cases and 15,936 entries, all of them took from the dev database, so it's real data.

To profile a query it's only required to add "profile":true to the query request body and set the cache to false.

GET /recap/_search?request_cache=false
{   
   "profile":true,
   "query":{}
}

And it will return the normal results with an additional profile key. Here is an example: profile_response.txt

The profiled query was a simple text query: ?type=r&q=apple&order_by=score%20desc

These are the results.

Normal, all child, all HL This is the query currently being used in production; it includes up to 6 child documents with the following highlighted fields for each child doc:

    "short_description"
    "short_description.exact"
    "description"
    "description.exact"
    "document_type"
    "document_type.exact"
    "document_number"
    "attachment_number"
    "plain_text"
    "plain_text.exact":

In this graph showing the results, it is possible to observe the time taken for each query component, measured in ms: normal-a-a

From this, we can see that the phase taking the most time is InnerHitsPhase.

Adding up all the phases, we get a total of 235.908916 ms for the entire request.

No child at all Then I performed the same query excluding all the child documents (inner_hits) from the results: 3d3680b7-da5b-4e53-9c8d-f105936807d5

Total query time: 2.963787 ms From this, we can confirm that the phase taking the most time is the display of child documents.

All child, no HL In the following query, I tested keeping child documents but removing the HL of child fields.

18c2c957-5ba8-46f6-b5ee-4d7682daea46

Total query time: 10.64146 ms

From here, we can see that including child documents in the results without applying highlighting increases the query time, but not as significantly as the query that includes HL.

So, the main problem appears to be the highlighting of child documents.

All child, excluding plain_text HL The following query includes child documents and HL for all the child fields except for plain_text c345a32c-c63a-4d41-b62d-323283bed131

Total query time: 14.377129 ms

Here we can see that HL of plain_text is taking the most time.

All child, only plain_text HL And we can confirm that by checking the inverse query, including only the plain_text HL. 47b66055-9234-4414-a3fd-e3e6c5d43b12

Total query time: 219.818077 ms

From the following query, we can confirm that the issue is not related to extracting plain_text and sending it in the results. This is considering that our query excludes plain_text from _source and instead always displays a plain_text fragment via highlighting when the field contains content. All childs, no HL, avoid excluding plaint_text

ebdd4608-b417-4bec-a816-3333e53d71d6

Total query time: 8.356079 ms

Finally from the following query profile we can confirm the issue is the HL of of plain_text and is directly related to the amount of content to be HL. By default our current HL process until 999_999 chars which is the max allowed. Reducing it to 999, decreased the query time.

All childs, All HL, limit plain_text HL to 999

8e51bd23-e653-4ad4-ba2e-23f615ab3ee2

Total query time: 18.753498 ms

We can observe that there is the HighlightPhase, which doesn't take much time, and also the InnerHitsPhase. According to these tests, it seems that the InnerHitsPhase performs the highlighting for child fields and also processes the extraction of child documents and their fields to render them into the results.

In the following graph, we can clearly observe all the previous queries and their respective times:

221618db-71ad-47cf-a566-63fb014ae26d

Benchmarking

The query times shown in the previous profiles vary with each request. I tried to take an average value, but they're not statistically representative. Consequently, I decided to also perform benchmarking to obtain more representative values, including the variable of load. We can replicate this benchmarking in our production cluster to have a reference before launching into production.

The benchmarks were conducted on my local testing cluster, which contains the same number of documents as reported in the previous profiles.

Using a current cluster with documents ensures that results are more accurate compared to an isolated environment with limited documents and different resources from the production cluster.

If we run these benchmarks in the production cluster, it is advisable to do so before going live, as the benchmarks are resource-intensive. It's also recommended to ensure snapshots are up-to-date as a precaution in case something goes wrong.

We'll need a pod that has connectivity to the production cluster via the load balancer. And install rally. pip3 install esrally

Then we need to place the race assets in a directory. (I'll attach all the assets in a following comment.)

And then run: esrally race --pipeline=benchmark-only --track-path=tracks/recap --target-hosts=localhost:9200 --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:'password'"

After tweaking the client options.

Here are the results I got from these benchmarks.

The benchmarking requests have the following parameters:

"clients": 5,
"warmup-iterations": 100,
"iterations": 100

6 child, all HL This is the query that is currently being used in production, it includes up to 6 child documents with all the selected fields HL.

|                                                 Min Throughput | search_recap_1 |   11.71        |  ops/s |
|                                                Mean Throughput | search_recap_1 |   11.93        |  ops/s |
|                                              Median Throughput | search_recap_1 |   11.95        |  ops/s |
|                                                 Max Throughput | search_recap_1 |   12.07        |  ops/s |

6 child, No HL All the child documents but not HL.

|                                                 Min Throughput | search_recap_1 | 170.56        |  ops/s |
|                                                Mean Throughput | search_recap_1 | 175.98        |  ops/s |
|                                              Median Throughput | search_recap_1 | 176.62        |  ops/s |
|                                                 Max Throughput | search_recap_1 | 180.75        |  ops/s |

6 child, HL only plain_text

|                                                 Min Throughput | search_recap_1 |   12.65        |  ops/s |
|                                                Mean Throughput | search_recap_1 |   12.7         |  ops/s |
|                                              Median Throughput | search_recap_1 |   12.71        |  ops/s |
|                                                 Max Throughput | search_recap_1 |   12.75        |  ops/s |

6 child, all HL except `plain_text

|                                                 Min Throughput | search_recap_1 | 125.72        |  ops/s |
|                                                Mean Throughput | search_recap_1 | 127.89        |  ops/s |
|                                              Median Throughput | search_recap_1 | 128.18        |  ops/s |
|                                                 Max Throughput | search_recap_1 | 129.49        |  ops/s |

No child Query that doesn't displayinner_hits at all.

|                                                 Min Throughput | search_recap_1 | 593.72        |  ops/s |
|                                                Mean Throughput | search_recap_1 | 593.72        |  ops/s |
|                                              Median Throughput | search_recap_1 | 593.72        |  ops/s |
|                                                 Max Throughput | search_recap_1 | 593.72        |  ops/s |

6 Childs, all HL, limit plain_text HL to 999 chars Query with all the child fields HL but limiting plain_text HL to 999 chars.

|                                                 Min Throughput | search_recap_1 | 108.39        |  ops/s |
|                                                Mean Throughput | search_recap_1 | 109.16        |  ops/s |
|                                              Median Throughput | search_recap_1 | 109.47        |  ops/s |
|                                                 Max Throughput | search_recap_1 | 109.66        |  ops/s |

In the following graph we can see these results more clearly.

31c5dabb-fcd5-4d16-8aab-b0743b8e237b

I also conducted some additional benchmarks to evaluate other variables.

Like the number of child documents (inner_hits) displayed.

f462a4ad-3a66-4d9b-9269-c93f08416ecd

From here, we can see that with all the HL activated, we achieve the best performance by retrieving only 1 child document. From 2 child documents onwards, performance starts to decrease.

Or the type of query requested. c763a59e-9473-49e7-bc19-f41db4156ac5

From here, we can see that performance also depends on the number of documents matched and the type of query requested. For instance, a Match All query has better performance than a text query with fewer documents matched, because the Match All query does not perform highlighting.

A filter or a filter combined with a text query can have good performance as long as only a few documents are matched.

In brief:

From these benchmarks and the previous profiles, we can conclude that the process taking the most time is the highlighting of plain_text. The worst performance is observed in our current query, which includes 6 child documents and highlights all selected child fields. You can notice that best performance is achieved with a query that does not display child documents. However, since displaying child documents is mandatory for our use case, our optimization goal could be to achieve a performance similar to the query '6 children, all HL except plain_text'. Therefore, we should consider the performance when assessing other highlighters in #3300 to see if it's possible to gain additional efficiency in this process.

albertisfu commented 9 months ago

Here are the assets used for:

Profiling:

profile_all_child_all_hl_plain_999_limit.json profile_all_child_hl_except_plain.json profile_all_child_hl_only_plain.json profile_all_child_no_hl.json profile_all_child_no_hl_no_exclude_plain.json profile_no_child.json profile_normal_all_child_all_hl.json

Benchmarking:

Text query: track_all_child_all_hl_plain_999_limit.json track_all_child_hl_only_plain.json track_all_child_no_hl.json track_no_child.json track_normal_all_child_all_hl.json track_all_child_hl_except_plain.json

Number of child documents (inner_hits) displayed

1_child_all_hl.json 2_child_all_hl.json 3_child_all_hl.json 4_child_all_hl.json 5_child_all_hl.json 6_child_all_hl.json

Type of query: track_exact_text_query.json track_filter_case_name_text_query.json track_filter_case_name.json track_match_all.json track_normal_text_query.json

mlissner commented 9 months ago

This looks great, thanks for researching this. Elastic sure makes this easier than anything I ever saw with Solr.

I'm not sure what the best next steps would be between this analysis, the opinions DB being almost ready (and easier for testing), and the need to try highlighters with different indexes. What do you suggest?

albertisfu commented 9 months ago

Yeah, If we are going to try the highlighters with different indexes, I can begin by assessing other highlighters and options using the recap index I used in this research and measure their performance too. This will provide us with a good starting point for testing the highlighters in production, at least in terms of performance. When you're back, it would be a good idea to execute at least one benchmark in production. This will give us a reference for how the cluster is performing with the current highlighters. Then, when we switch the highlighters, we can evaluate the performance again.

albertisfu commented 9 months ago

Following up on this issue, I've tested other highlighters on my RECAP testing indexes.

As reference I used the HL combinations described in this post that Eduardo shared with me.

These are my findings.

The testing index had: 2,329 Cases and 17,672 Docket Entries taken from the dev database. Additionally I added a couple of large documents taken from this case where some of their plain_text has more than 2000 pages. So HL can be assessed in these documents, allowing us to determine which HL works well on plain_text with more than 1,000,000 characters, which is the current limit we have when using the plain HL.

The HL fields are:

"short_description"
"short_description.exact"
"description"
"description.exact"
"document_type"
"document_type.exact"
"document_number"
"attachment_number"
"plain_text"
"plain_text.exact"

With up to 6 child documents rendered by case in results.

Highlighter	Offset strategy	Index Size	Mean Throughput ops/sec	HL limit
Plain	None	28.5mb	7.48	999_999
Unified	None	28.5mb	5.82	999_999
Unified	Postings	37.3mb	7.37	No limit
Unified	Terms vector	63.5mb	5.94	No limit
FVH	Terms vector	63.5mb	18.45	No limit

From these results, we can observe that using this testing index:

Unified with no Offset strategy: This HL doesn't require creating a offset list or vectors, so it doesn't modify the index size. However, setting an HL chars limit is mandatory; otherwise, it throws an error. This is not suitable if we expect to highlight large documents. Performance is a bit lower than the plain HL.
FVH: This is the best-performing highlighter. The advantage is that setting an HL limit is not required, so it's able to analyze and highlight pretty large documents. The downside is that the index uses more disk space since it stores term vectors during indexing time to perform the highlighting faster. Also, using this highlighter might require some tweaks to the merge_highlights_into_result method since this HL returns whole phrases highlighted instead of independent terms.
Unified with Postings: This HL uses a terms list offsets stored at indexing time, so the index size is also bigger. It doesn't require setting an HL limit, and the performance is similar to the plain HL in this benchmark.
Unified with Vectors: This HL uses the same index with term vectors as the unified HL. It doesn't require setting an HL limit either, and the performance is a bit lower than Unified with Postings, according to this benchmark.

This analysis can help us as a starting point for performing similar benchmarking in production so we can choose the best HL for the different indexes.

The good news is that the index changes required for the unified and fvh HL can be applied by creating a new index and using the re_index API, which will copy the whole index and create the required vectors or offsets lists during the process.

So the process perform these benchmarks in prod can be the following:

Apply the required changes to the RECAP index mapping:
- For postings, add index_options="offsets" to each Text field.
- For vectors, add term_vector="with_positions_offsets", to each Text field.
Change the RECAP index name in es_indices. e.g: recap_index = Index("recap_postings")
Create the new index. docker exec -it cl-django python /opt/courtlistener/manage.py search_index --create --models search.Docket These changes can be done in a maintenance pod.
Then we'll need to copy the current index content to the new index using the re_index API:

POST
/_reindex
{
    "max_docs":10000000, #Number of documents we want to copy.
    "source":{
        "index":"recap"
    },
    "dest":{
        "index":"recap_postings"
    }
}

When the re-indexing is finish we can run the benchmark using this new index.

esrally race --pipeline=benchmark-only --track-path=tracks/recap --target-hosts=localhost:9200 --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:'password'"

The tracks I used for these benchmarks are: track_plain.json track_fhv_vector.json track_unified_vectors.json track_unified_postings.json track_unified_no_offsets_limited.json

mlissner commented 9 months ago

Sounds great. Let's get those index/code changes in place for FVH, and then give that a go in prod. Thank you for the detailed research.

mlissner commented 9 months ago

Per the instructions in #3550, I created a new index with 2.7M items, then ran esrally using (a slightly tweaked) version of track_fhv_vector.json from above (it needed the index name tweaked):

root@maintenance-ml:/opt/courtlistener# esrally race --pipeline=benchmark-only --track-path=track.json --target-hosts=$ELASTICSEARCH_DSL_HOST --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:'xxx'"

       ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [8193eef4-6da4-4f46-8f29-9a65a9501bf9]
[INFO] Racing on track [track] and car ['external'] with version [8.8.1].

[WARNING] merges_total_time is 44831354 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] merges_total_throttled_time is 7275732 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 115689703 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 16727398 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 24023556 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running cluster-health                                                         [100% done]
Running search_recap_1                                                         [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |           Task |            Value |   Unit |
|---------------------------------------------------------------:|---------------:|-----------------:|-------:|
|                     Cumulative indexing time of primary shards |                |   1928.24        |    min |
|             Min cumulative indexing time across primary shards |                |      0           |    min |
|          Median cumulative indexing time across primary shards |                |     13.451       |    min |
|             Max cumulative indexing time across primary shards |                |    132.828       |    min |
|            Cumulative indexing throttle time of primary shards |                |      0           |    min |
|    Min cumulative indexing throttle time across primary shards |                |      0           |    min |
| Median cumulative indexing throttle time across primary shards |                |      0           |    min |
|    Max cumulative indexing throttle time across primary shards |                |      0           |    min |
|                        Cumulative merge time of primary shards |                |    748.243       |    min |
|                       Cumulative merge count of primary shards |                | 124232           |        |
|                Min cumulative merge time across primary shards |                |      0           |    min |
|             Median cumulative merge time across primary shards |                |      7.23336     |    min |
|                Max cumulative merge time across primary shards |                |     72.9313      |    min |
|               Cumulative merge throttle time of primary shards |                |    121.689       |    min |
|       Min cumulative merge throttle time across primary shards |                |      0           |    min |
|    Median cumulative merge throttle time across primary shards |                |      0.308858    |    min |
|       Max cumulative merge throttle time across primary shards |                |     32.9825      |    min |
|                      Cumulative refresh time of primary shards |                |    279.09        |    min |
|                     Cumulative refresh count of primary shards |                |      1.26767e+06 |        |
|              Min cumulative refresh time across primary shards |                |      0           |    min |
|           Median cumulative refresh time across primary shards |                |      0.329433    |    min |
|              Max cumulative refresh time across primary shards |                |     13.5236      |    min |
|                        Cumulative flush time of primary shards |                |    400.4         |    min |
|                       Cumulative flush count of primary shards |                | 274151           |        |
|                Min cumulative flush time across primary shards |                |      8.33333e-05 |    min |
|             Median cumulative flush time across primary shards |                |      4.46041     |    min |
|                Max cumulative flush time across primary shards |                |     13.7263      |    min |
|                                        Total Young Gen GC time |                |      1.189       |      s |
|                                       Total Young Gen GC count |                |     29           |        |
|                                          Total Old Gen GC time |                |      0           |      s |
|                                         Total Old Gen GC count |                |      0           |        |
|                                                     Store size |                |    776.153       |     GB |
|                                                  Translog size |                |      0.145759    |     GB |
|                                         Heap used for segments |                |      0           |     MB |
|                                       Heap used for doc values |                |      0           |     MB |
|                                            Heap used for terms |                |      0           |     MB |
|                                            Heap used for norms |                |      0           |     MB |
|                                           Heap used for points |                |      0           |     MB |
|                                    Heap used for stored fields |                |      0           |     MB |
|                                                  Segment count |                |   1573           |        |
|                                    Total Ingest Pipeline count |                |      0           |        |
|                                     Total Ingest Pipeline time |                |      0           |      s |
|                                   Total Ingest Pipeline failed |                |      0           |        |
|                                                 Min Throughput | search_recap_1 |     13.69        |  ops/s |
|                                                Mean Throughput | search_recap_1 |     15.77        |  ops/s |
|                                              Median Throughput | search_recap_1 |     15.83        |  ops/s |
|                                                 Max Throughput | search_recap_1 |     17.52        |  ops/s |
|                                        50th percentile latency | search_recap_1 |    186.781       |     ms |
|                                        90th percentile latency | search_recap_1 |    245.813       |     ms |
|                                        99th percentile latency | search_recap_1 |    295.616       |     ms |
|                                       100th percentile latency | search_recap_1 |    318.035       |     ms |
|                                   50th percentile service time | search_recap_1 |    186.781       |     ms |
|                                   90th percentile service time | search_recap_1 |    245.813       |     ms |
|                                   99th percentile service time | search_recap_1 |    295.616       |     ms |
|                                  100th percentile service time | search_recap_1 |    318.035       |     ms |
|                                                     error rate | search_recap_1 |      0           |      % |

--------------------------------
[INFO] SUCCESS (took 77 seconds)
--------------------------------

mlissner commented 9 months ago

And here's the result with track_plain.json (with tweaks to clients so they align with FHV):

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |           Task |            Value |   Unit |
|---------------------------------------------------------------:|---------------:|-----------------:|-------:|
|                     Cumulative indexing time of primary shards |                |   1929.77        |    min |
|             Min cumulative indexing time across primary shards |                |      0           |    min |
|          Median cumulative indexing time across primary shards |                |     13.2665      |    min |
|             Max cumulative indexing time across primary shards |                |    132.9         |    min |
|            Cumulative indexing throttle time of primary shards |                |      0           |    min |
|    Min cumulative indexing throttle time across primary shards |                |      0           |    min |
| Median cumulative indexing throttle time across primary shards |                |      0           |    min |
|    Max cumulative indexing throttle time across primary shards |                |      0           |    min |
|                        Cumulative merge time of primary shards |                |    748.655       |    min |
|                       Cumulative merge count of primary shards |                | 124397           |        |
|                Min cumulative merge time across primary shards |                |      0           |    min |
|             Median cumulative merge time across primary shards |                |      7.14795     |    min |
|                Max cumulative merge time across primary shards |                |     72.9445      |    min |
|               Cumulative merge throttle time of primary shards |                |    121.689       |    min |
|       Min cumulative merge throttle time across primary shards |                |      0           |    min |
|    Median cumulative merge throttle time across primary shards |                |      0.298867    |    min |
|       Max cumulative merge throttle time across primary shards |                |     32.9825      |    min |
|                      Cumulative refresh time of primary shards |                |    279.378       |    min |
|                     Cumulative refresh count of primary shards |                |      1.26933e+06 |        |
|              Min cumulative refresh time across primary shards |                |      0           |    min |
|           Median cumulative refresh time across primary shards |                |      0.32225     |    min |
|              Max cumulative refresh time across primary shards |                |     13.529       |    min |
|                        Cumulative flush time of primary shards |                |    400.702       |    min |
|                       Cumulative flush count of primary shards |                | 274476           |        |
|                Min cumulative flush time across primary shards |                |      8.33333e-05 |    min |
|             Median cumulative flush time across primary shards |                |      4.06453     |    min |
|                Max cumulative flush time across primary shards |                |     13.7384      |    min |
|                                        Total Young Gen GC time |                |      1.487       |      s |
|                                       Total Young Gen GC count |                |     36           |        |
|                                          Total Old Gen GC time |                |      0           |      s |
|                                         Total Old Gen GC count |                |      0           |        |
|                                                     Store size |                |    774.318       |     GB |
|                                                  Translog size |                |      0.00284654  |     GB |
|                                         Heap used for segments |                |      0           |     MB |
|                                       Heap used for doc values |                |      0           |     MB |
|                                            Heap used for terms |                |      0           |     MB |
|                                            Heap used for norms |                |      0           |     MB |
|                                           Heap used for points |                |      0           |     MB |
|                                    Heap used for stored fields |                |      0           |     MB |
|                                                  Segment count |                |   1570           |        |
|                                    Total Ingest Pipeline count |                |      0           |        |
|                                     Total Ingest Pipeline time |                |      0           |      s |
|                                   Total Ingest Pipeline failed |                |      0           |        |
|                                                 Min Throughput | search_recap_1 |     24.15        |  ops/s |
|                                                Mean Throughput | search_recap_1 |     24.32        |  ops/s |
|                                              Median Throughput | search_recap_1 |     24.33        |  ops/s |
|                                                 Max Throughput | search_recap_1 |     24.43        |  ops/s |
|                                        50th percentile latency | search_recap_1 |    197.64        |     ms |
|                                        90th percentile latency | search_recap_1 |    245.809       |     ms |
|                                        99th percentile latency | search_recap_1 |    285.975       |     ms |
|                                       100th percentile latency | search_recap_1 |    349.173       |     ms |
|                                   50th percentile service time | search_recap_1 |    197.64        |     ms |
|                                   90th percentile service time | search_recap_1 |    245.809       |     ms |
|                                   99th percentile service time | search_recap_1 |    285.975       |     ms |
|                                  100th percentile service time | search_recap_1 |    349.173       |     ms |
|                                                     error rate | search_recap_1 |      0           |      % |

--------------------------------
[INFO] SUCCESS (took 72 seconds)
--------------------------------

mlissner commented 9 months ago

Just to check our work, we're making a new index with plain highlighters. I deleted the term_vector parameters from documents.py, created a new index named recap_plain, and I'm copying the vector index into it:

POST /_reindex?wait_for_completion=false
{
    "source":{
        "index":"recap_vectors"
    },
    "dest":{
        "index":"recap_plain"
    }
}

When that completed, I re-ran the esrally against it:

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [c7c50798-b8a6-4893-890d-bf6a9af3cdd7]
[INFO] Racing on track [track_plain] and car ['external'] with version [8.8.1].

[WARNING] merges_total_time is 45726362 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] merges_total_throttled_time is 7382861 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 121952200 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 16857283 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 26196947 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running cluster-health                                                         [100% done]
Running search_recap_1                                                         [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |           Task |            Value |   Unit |
|---------------------------------------------------------------:|---------------:|-----------------:|-------:|
|                     Cumulative indexing time of primary shards |                |   2032.55        |    min |
|             Min cumulative indexing time across primary shards |                |      0           |    min |
|          Median cumulative indexing time across primary shards |                |      4.86531     |    min |
|             Max cumulative indexing time across primary shards |                |    133.096       |    min |
|            Cumulative indexing throttle time of primary shards |                |      0           |    min |
|    Min cumulative indexing throttle time across primary shards |                |      0           |    min |
| Median cumulative indexing throttle time across primary shards |                |      0           |    min |
|    Max cumulative indexing throttle time across primary shards |                |      0           |    min |
|                        Cumulative merge time of primary shards |                |    762.331       |    min |
|                       Cumulative merge count of primary shards |                | 124806           |        |
|                Min cumulative merge time across primary shards |                |      0           |    min |
|             Median cumulative merge time across primary shards |                |      3.02768     |    min |
|                Max cumulative merge time across primary shards |                |     72.9464      |    min |
|               Cumulative merge throttle time of primary shards |                |    123.07        |    min |
|       Min cumulative merge throttle time across primary shards |                |      0           |    min |
|    Median cumulative merge throttle time across primary shards |                |      0.0872083   |    min |
|       Max cumulative merge throttle time across primary shards |                |     32.9825      |    min |
|                      Cumulative refresh time of primary shards |                |    281.619       |    min |
|                     Cumulative refresh count of primary shards |                |      1.27326e+06 |        |
|              Min cumulative refresh time across primary shards |                |      0           |    min |
|           Median cumulative refresh time across primary shards |                |      0.133883    |    min |
|              Max cumulative refresh time across primary shards |                |     13.5441      |    min |
|                        Cumulative flush time of primary shards |                |    436.642       |    min |
|                       Cumulative flush count of primary shards |                | 275797           |        |
|                Min cumulative flush time across primary shards |                |      8.33333e-05 |    min |
|             Median cumulative flush time across primary shards |                |      1.976       |    min |
|                Max cumulative flush time across primary shards |                |     13.7568      |    min |
|                                        Total Young Gen GC time |                |      1.57        |      s |
|                                       Total Young Gen GC count |                |     37           |        |
|                                          Total Old Gen GC time |                |      0           |      s |
|                                         Total Old Gen GC count |                |      0           |        |
|                                                     Store size |                |    786.204       |     GB |
|                                                  Translog size |                |      0.202693    |     GB |
|                                         Heap used for segments |                |      0           |     MB |
|                                       Heap used for doc values |                |      0           |     MB |
|                                            Heap used for terms |                |      0           |     MB |
|                                            Heap used for norms |                |      0           |     MB |
|                                           Heap used for points |                |      0           |     MB |
|                                    Heap used for stored fields |                |      0           |     MB |
|                                                  Segment count |                |   2058           |        |
|                                    Total Ingest Pipeline count |                |      0           |        |
|                                     Total Ingest Pipeline time |                |      0           |      s |
|                                   Total Ingest Pipeline failed |                |      0           |        |
|                                                 Min Throughput | search_recap_1 |      8.1         |  ops/s |
|                                                Mean Throughput | search_recap_1 |     10.56        |  ops/s |
|                                              Median Throughput | search_recap_1 |     10.67        |  ops/s |
|                                                 Max Throughput | search_recap_1 |     12.49        |  ops/s |
|                                        50th percentile latency | search_recap_1 |    183.585       |     ms |
|                                        90th percentile latency | search_recap_1 |    233.26        |     ms |
|                                        99th percentile latency | search_recap_1 |    280.198       |     ms |
|                                       100th percentile latency | search_recap_1 |    347.43        |     ms |
|                                   50th percentile service time | search_recap_1 |    183.585       |     ms |
|                                   90th percentile service time | search_recap_1 |    233.26        |     ms |
|                                   99th percentile service time | search_recap_1 |    280.198       |     ms |
|                                  100th percentile service time | search_recap_1 |    347.43        |     ms |
|                                                     error rate | search_recap_1 |      0           |      % |

--------------------------------
[INFO] SUCCESS (took 92 seconds)
--------------------------------

And to compare disk sizes, we have, vectors:

index         store.size
recap_vectors     24.1gb

Plain:

index       store.size
recap_plain     11.8gb

mlissner commented 8 months ago

The switch to actually using this highlighter seems to be helping, but performance still isn't great. (Though it might get better as the index warms, I'm not sure.)

Alberto also says he has a trick for pagination that might help, so we should attempt that if we can.

mlissner commented 7 months ago

Pagination didn't help, but I had a thought tonight while researching something in RECAP. We currently do 20 cases per page, each with five sub-documents, for a total of 100 results/page. If we halved that, would it help with performance? We'd still have very long pages, with 50 items per page?

albertisfu commented 7 months ago

Yeah, in terms of ES performance, the number of child documents directly impacts it.

Here's one of the graphs posted previously, comparing the retrieval of a varying number of child documents in a query.

293294011-8eb6803c-b1aa-4908-b5ec-0de89fb2e474

This benchmark was performed when using plain_text highlighting. It could change now that we use FVH. However, we can see that reducing the number of child documents to 3 or 4 can increase the throughput to almost twice that of having 5 child documents.

Additionally, we could consider some kind of improvement for transport and rendering the page.

mlissner commented 7 months ago

Sorry, I was unclear. I meant halving the number of parent documents. I assume that'd help too?

albertisfu commented 7 months ago

Got it! I didn't make a test varying the number of parent documents so I'll do it to measure how much improvement we can get by halving the number of parent documents

albertisfu commented 7 months ago

Ok I did the benchmark and here are the results:

For a query_string: filed, which retrieved a full page of parent documents, each with 5 children.

20 Results per page:

|                                                 Min Throughput | search_recap_size_20 |  40.37        |  ops/s |
|                                                Mean Throughput | search_recap_size_20 |  40.96        |  ops/s |
|                                              Median Throughput | search_recap_size_20 |  40.81        |  ops/s |
|                                                 Max Throughput | search_recap_size_20 |  41.81        |  ops/s |
|                                        50th percentile latency | search_recap_size_20 |  58.9689      |     ms |
|                                        90th percentile latency | search_recap_size_20 |  75.6003      |     ms |
|                                        99th percentile latency | search_recap_size_20 |  97.0243      |     ms |
|                                       100th percentile latency | search_recap_size_20 | 118.104       |     ms |
|                                   50th percentile service time | search_recap_size_20 |  58.9689      |     ms |
|                                   90th percentile service time | search_recap_size_20 |  75.6003      |     ms |
|                                   99th percentile service time | search_recap_size_20 |  97.0243      |     ms |
|                                  100th percentile service time | search_recap_size_20 | 118.104       |     ms |
|                                                     error rate | search_recap_size_20 |   0           |      % |

10 Results per page:

|                                                 Min Throughput | search_recap_size_10 | 77.78        |  ops/s |
|                                                Mean Throughput | search_recap_size_10 | 79.22        |  ops/s |
|                                              Median Throughput | search_recap_size_10 | 79.41        |  ops/s |
|                                                 Max Throughput | search_recap_size_10 | 80.26        |  ops/s |
|                                        50th percentile latency | search_recap_size_10 | 31.0351      |     ms |
|                                        90th percentile latency | search_recap_size_10 | 38.7568      |     ms |
|                                        99th percentile latency | search_recap_size_10 | 62.0943      |     ms |
|                                       100th percentile latency | search_recap_size_10 | 93.46        |     ms |
|                                   50th percentile service time | search_recap_size_10 | 31.0351      |     ms |
|                                   90th percentile service time | search_recap_size_10 | 38.7568      |     ms |
|                                   99th percentile service time | search_recap_size_10 | 62.0943      |     ms |
|                                  100th percentile service time | search_recap_size_10 | 93.46        |     ms |
|                                                     error rate | search_recap_size_10 |  0           |      % |

track_10_results.json track_20_results.json

So, according to these results, halving parent results roughly doubles the search speed. Final results may vary depending on the query and filters applied, but it seems that indeed there is an improvement in performance from halving the results.

mlissner commented 7 months ago

Awesome. Let's do it.

mlissner commented 7 months ago

https://github.com/freelawproject/courtlistener/issues/3819 has another performance enhancement we can do for queries ordered by date filed.

albertisfu commented 7 months ago

This article describe some techniques to improve the search performance in ES.

I've been testing and measuring suggestions related to index changes that could be applied in our case. This way, if we need to modify something in the index, we can do so before starting the new RECAP re-index.

Search as Few Fields as Possible

This suggestion implies that if our query uses a query_string with multiple fields, it's better to merge the content of those fields at search time and only use that field for searching, while preserving the independent fields solely for display purposes.

In ES, the copy-to parameter allows us to automate the process of copying the content from multiple fields into a single field specifically designated for searching.

Accordingly, I have added a new field to the DocketBaseDocket called all_fields and copied all the fields used for querying into this field.

"case_name_full",
"suitNature",
"juryDemand",
"cause",
"assignedTo",
"referredTo",
"court",
"court_id",
 "court_citation_string",
"chapter",
"trustee_str",
"short_description",
"plain_text",
"document_type"

Then I did a re_index from my testing base index:

This is the index with the original size: recap_index_vector 52.6mb 52.6mb

This is the index after copied multi-fields into the new "all_fields" for search. recap_vectors_all 75.2mb 75.2mb

Then, I performed multiple benchmarks and query profiling to determine whether this provides a boost in performance or not.

Current approach, query string on multiple fields and Highlighting enabled.

|                                                 Min Throughput | search_recap current |   16.56        |  ops/s |
|                                                Mean Throughput | search_recap current |   16.67        |  ops/s |
|                                              Median Throughput | search_recap current |   16.67        |  ops/s |
|                                                 Max Throughput | search_recap current |   16.76        |  ops/s |
|                                        50th percentile latency | search_recap current |  169.736       |     ms |
|                                        90th percentile latency | search_recap current |  176.354       |     ms |
|                                        99th percentile latency | search_recap current |  183.92        |     ms |
|                                      99.9th percentile latency | search_recap current |  191.892       |     ms |
|                                       100th percentile latency | search_recap current |  193.15        |     ms |
|                                   50th percentile service time | search_recap current |  169.736       |     ms |
|                                   90th percentile service time | search_recap current |  176.354       |     ms |
|                                   99th percentile service time | search_recap current |  183.92        |     ms |
|                                 99.9th percentile service time | search_recap current |  191.892       |     ms |
|                                  100th percentile service time | search_recap current |  193.15        |     ms |
|                                                     error rate | search_recap current |    0           |      % |

copy_to approach, searching over the new field all_fields HL enabled. In this approach, it searches on all_fields, but we still apply highlighting to other fields than the one used for searching. This is possible by setting require_field_match to False. So, if a term is found in all_fields, that term is also highlighted in other fields where it can appear. This is useful for preserving our highlighting behavior across different fields.

|                                                 Min Throughput | search_recap all fields Normal |    7.09        |  ops/s |
|                                                Mean Throughput | search_recap all fields Normal |    7.17        |  ops/s |
|                                              Median Throughput | search_recap all fields Normal |    7.17        |  ops/s |
|                                                 Max Throughput | search_recap all fields Normal |    7.22        |  ops/s |
|                                        50th percentile latency | search_recap all fields Normal |  392.368       |     ms |
|                                        90th percentile latency | search_recap all fields Normal |  458.192       |     ms |
|                                        99th percentile latency | search_recap all fields Normal |  494.602       |     ms |
|                                       100th percentile latency | search_recap all fields Normal |  516.665       |     ms |
|                                   50th percentile service time | search_recap all fields Normal |  392.368       |     ms |
|                                   90th percentile service time | search_recap all fields Normal |  458.192       |     ms |
|                                   99th percentile service time | search_recap all fields Normal |  494.602       |     ms |
|                                  100th percentile service time | search_recap all fields Normal |  516.665       |     ms |
|                                                     error rate | search_recap all fields Normal |    0           |      % |

We can see that the throughput has not improved and in fact, it's worse.

So I did other measures to identify where the problem was in this approach, which should have been faster.

I suspected the issue was with highlighting using require_field_match = False and did the following benchmarks.

copy_to approach, searching over the new field all_fields HL completely disabled

|                                                 Min Throughput | search_recap all fields no HL |  72.35        |  ops/s |
|                                                Mean Throughput | search_recap all fields no HL |  72.96        |  ops/s |
|                                              Median Throughput | search_recap all fields no HL |  73.14        |  ops/s |
|                                                 Max Throughput | search_recap all fields no HL |  73.22        |  ops/s |
|                                        50th percentile latency | search_recap all fields no HL |  37.6347      |     ms |
|                                        90th percentile latency | search_recap all fields no HL |  44.4349      |     ms |
|                                        99th percentile latency | search_recap all fields no HL |  58.9945      |     ms |
|                                       100th percentile latency | search_recap all fields no HL |  62.8425      |     ms |
|                                   50th percentile service time | search_recap all fields no HL |  37.6347      |     ms |
|                                   90th percentile service time | search_recap all fields no HL |  44.4349      |     ms |
|                                   99th percentile service time | search_recap all fields no HL |  58.9945      |     ms |
|                                  100th percentile service time | search_recap all fields no HL |  62.8425      |     ms |
|                                                     error rate | search_recap all fields no HL |   0           |      % |

Current approach, query string on multiple fields and Highlighting completely disabled.

|                                                 Min Throughput | search_recap_int_doc multi no hl |  67.82        |  ops/s |
|                                                Mean Throughput | search_recap_int_doc multi no hl |  68.27        |  ops/s |
|                                              Median Throughput | search_recap_int_doc multi no hl |  68.25        |  ops/s |
|                                                 Max Throughput | search_recap_int_doc multi no hl |  68.6         |  ops/s |
|                                        50th percentile latency | search_recap_int_doc multi no hl |  39.3914      |     ms |
|                                        90th percentile latency | search_recap_int_doc multi no hl |  49.0279      |     ms |
|                                        99th percentile latency | search_recap_int_doc multi no hl |  72.2078      |     ms |
|                                       100th percentile latency | search_recap_int_doc multi no hl | 106.778       |     ms |
|                                   50th percentile service time | search_recap_int_doc multi no hl |  39.3914      |     ms |
|                                   90th percentile service time | search_recap_int_doc multi no hl |  49.0279      |     ms |
|                                   99th percentile service time | search_recap_int_doc multi no hl |  72.2078      |     ms |
|                                  100th percentile service time | search_recap_int_doc multi no hl | 106.778       |     ms |
|                                                     error rate | search_recap_int_doc multi no hl |   0           |      % |

We can see an slight improvement of the all_fields approach when disabling highlighting.

Then performed other benchmarks to confirm the theory.

copy_to approach, searching over the new field all_fields HL only over all_fields

In this approach all_fields is used for search and also for HL, disabling HL for all the other fields.

|                                                 Min Throughput | search_recap all fields HL |  16.89        |  ops/s |
|                                                Mean Throughput | search_recap all fields HL |  17.02        |  ops/s |
|                                              Median Throughput | search_recap all fields HL |  17.04        |  ops/s |
|                                                 Max Throughput | search_recap all fields HL |  17.09        |  ops/s |
|                                        50th percentile latency | search_recap all fields HL | 170.136       |     ms |
|                                        90th percentile latency | search_recap all fields HL | 181.265       |     ms |
|                                        99th percentile latency | search_recap all fields HL | 232.612       |     ms |
|                                       100th percentile latency | search_recap all fields HL | 245.52        |     ms |
|                                   50th percentile service time | search_recap all fields HL | 170.136       |     ms |
|                                   90th percentile service time | search_recap all fields HL | 181.265       |     ms |
|                                   99th percentile service time | search_recap all fields HL | 232.612       |     ms |
|                                  100th percentile service time | search_recap all fields HL | 245.52        |     ms |
|                                                     error rate | search_recap all fields HL |   0       |      % |

We can see that the throughput is slightly better (though not statistically significant) than the current approach, which involves searching over multiple fields and applying highlighting over multiple fields as well. However, this comes with the disadvantage of losing HL for independent fields.

The issue of all_fields with HL is illustrated in the following query profiles:

Current approach, query string on multiple fields and Highlighting enabled.

BooleanQuery	0.558718
rewrite_time	0.633584
collector	0.016288
children FetchSourcePhase	0.018292
Children: HighlightPhase	34.038625
Children: InnerHitsPhase	121.066542
Children: StoredFieldsPhase	0.009167
Total:	156.341216

copy_to approach, searching over the new field `all_fields` HL enabled.	BooleanQuery	0.445343
rewrite_time	0.525458
collector	0.013835
children FetchSourcePhase	0.510458
Children: HighlightPhase	20.785165
Children: InnerHitsPhase	314.997749
Children: StoredFieldsPhase	0.014709
Total:	337.292717

The BooleanQuery time is slightly better in the all_fields approach; however, we can see in the InnerHitsPhase (which includes the highlighting of the most expensive field, plain_text) how applying HL to all the other fields using require_field_match = False is costly.

In summary, using a single field for searching seems to offer a slight improvement in performance compared to the multi-field approach. However, in our use case, the query phase that consumes the most time is highlighting (specifically, plain_text HL). Highlighting fields other than the one matched is expensive, resulting in worse overall performance. As an alternative, we could consider applying HL only to the all_fields field. However, as shown above, the performance improvement is not significantly better than the current approach, making it not worthwhile to lose HL in all the other fields.

Consider mapping identifiers as keyword

I also tested this suggestion.

The suggestion says that keyword fields perform better when used in a term filter, while ints and long fields are optimized for range queries. Therefore, if we have ints and long fields that are not used in range queries, they should be indexed as keywords instead.

To evaluate this, I re-indexed some int fields as keyword (document_number and attachment_number) and conducted a benchmark.

The test query only used a term filter for document_number

document_number indexed as int

|                                                 Min Throughput | search_recap_int_doc | 265.34        |  ops/s |
|                                                Mean Throughput | search_recap_int_doc | 265.34        |  ops/s |
|                                              Median Throughput | search_recap_int_doc | 265.34        |  ops/s |
|                                                 Max Throughput | search_recap_int_doc | 265.34        |  ops/s |
|                                        50th percentile latency | search_recap_int_doc |   6.85777     |     ms |
|                                        90th percentile latency | search_recap_int_doc |  11.1501      |     ms |
|                                        99th percentile latency | search_recap_int_doc |  18.3515      |     ms |
|                                       100th percentile latency | search_recap_int_doc |  33.2133      |     ms |
|                                   50th percentile service time | search_recap_int_doc |   6.85777     |     ms |
|                                   90th percentile service time | search_recap_int_doc |  11.1501      |     ms |
|                                   99th percentile service time | search_recap_int_doc |  18.3515      |     ms |
|                                  100th percentile service time | search_recap_int_doc |  33.2133      |     ms |
|                                                     error rate | search_recap_int_doc |   0           |      % |

document_number indexed as keyword

|                                                 Min Throughput | search_recap_keyword_doc | 264.87        |  ops/s |
|                                                Mean Throughput | search_recap_keyword_doc | 264.87        |  ops/s |
|                                              Median Throughput | search_recap_keyword_doc | 264.87        |  ops/s |
|                                                 Max Throughput | search_recap_keyword_doc | 264.87        |  ops/s |
|                                        50th percentile latency | search_recap_keyword_doc |   8.38983     |     ms |
|                                        90th percentile latency | search_recap_keyword_doc |  16.3272      |     ms |
|                                        99th percentile latency | search_recap_keyword_doc |  32.2095      |     ms |
|                                       100th percentile latency | search_recap_keyword_doc |  47.2592      |     ms |
|                                   50th percentile service time | search_recap_keyword_doc |   8.38983     |     ms |
|                                   90th percentile service time | search_recap_keyword_doc |  16.3272      |     ms |
|                                   99th percentile service time | search_recap_keyword_doc |  32.2095      |     ms |
|                                  100th percentile service time | search_recap_keyword_doc |  47.2592      |     ms |
|                                                     error rate | search_recap_keyword_doc |   0           |      % |

I performed multiple benchmark here because the results were quite similar, with one occasionally outperforming the other and vice versa. I think in this case, the query is so fast that it can be significantly affected by the testing environment's noise. In conclusion, I can't say if one is better than the other. Perhaps running a test on a larger index in production could tell us more conclusive results. However, generally, it seems there is not much improvement from moving from int to keyword.

However, I found another reason why moving to keyword could be a good idea. Currently, we have some IDs set to ints, which have a maximum value of 2^31 or 2,147,483,648, so it seems like a large limit, but thinking about the future, it might be better to have a higher limit, especially for RECAPDocuments and Dockets.

I still need to investigate an additional suggestion related to phrase search. I'll report my findings here as well.

albertisfu commented 7 months ago

Continuing to evaluate the potential search performance improvements that could be applied to our use case, I assessed:

Faster phrase queries with `index_phrases`

This setting seemed promising; enabling the index_phrases option on .exact fields showed a significant improvement in phrase queries. The idea behind enabling this option is that two-term word combinations (shingles) are indexed into a separate field, giving a boost to exact phrase queries.

I enabled the option for all the RECAP exact fields we use for query_string and performed a benchmark.

Phrase search query without enabling index_phrases

|                                                 Min Throughput | search_recap phrase normal | 187.55        |  ops/s |
|                                                Mean Throughput | search_recap phrase normal | 188.03        |  ops/s |
|                                              Median Throughput | search_recap phrase normal | 187.55        |  ops/s |
|                                                 Max Throughput | search_recap phrase normal | 188.99        |  ops/s |
|                                        50th percentile latency | search_recap phrase normal |  13.7903      |     ms |
|                                        90th percentile latency | search_recap phrase normal |  20.0521      |     ms |
|                                        99th percentile latency | search_recap phrase normal |  29.4487      |     ms |
|                                       100th percentile latency | search_recap phrase normal |  34.999       |     ms |
|                                   50th percentile service time | search_recap phrase normal |  13.7903      |     ms |
|                                   90th percentile service time | search_recap phrase normal |  20.0521      |     ms |
|                                   99th percentile service time | search_recap phrase normal |  29.4487      |     ms |
|                                  100th percentile service time | search_recap phrase normal |  34.999       |     ms |
|                                                     error rate | search_recap phrase normal |   0           |      % |

Phrase search query with index_phrases enabled.

|                                                 Min Throughput | search_recap phrase shingles | 314.82        |  ops/s |
|                                                Mean Throughput | search_recap phrase shingles | 317.55        |  ops/s |
|                                              Median Throughput | search_recap phrase shingles | 317.55        |  ops/s |
|                                                 Max Throughput | search_recap phrase shingles | 320.28        |  ops/s |
|                                        50th percentile latency | search_recap phrase shingles |   6.80929     |     ms |
|                                        90th percentile latency | search_recap phrase shingles |  10.6978      |     ms |
|                                        99th percentile latency | search_recap phrase shingles |  21.9254      |     ms |
|                                       100th percentile latency | search_recap phrase shingles |  31.1703      |     ms |
|                                   50th percentile service time | search_recap phrase shingles |   6.80929     |     ms |
|                                   90th percentile service time | search_recap phrase shingles |  10.6978      |     ms |
|                                   99th percentile service time | search_recap phrase shingles |  21.9254      |     ms |
|                                  100th percentile service time | search_recap phrase shingles |  31.1703      |     ms |
|                                                     error rate | search_recap phrase shingles |   0           |      % |

As it can be seen, there appears to be a significant improvement in exact phrase search by enabling this option. However, after applying this option and running tests, all tests involving phrase search and highlighting failed.

Doing some research it seems this is a ES bug. So, the observed boost might have been due solely to no highlighting being applied to the results. In any case, since we cannot stop doing highlighting on phrase queries, this option does not seem viable for us.

Warm up global ordinals

I also evaluated this option, where global ordinals are calculated and stored as part of the field cache. They help improve the performance of aggregations, as well as operations involving a join field, like in our case with has_child and has_parent queries.

Thus, by enabling this option, ES will construct and cache the global ordinals before search requests are performed, instead of building them as they are requested.

I enabled this option and conducted a benchmark:

eager_global_ordinals disabled

|                                                 Min Throughput | search_recap no globals |  15.25        |  ops/s |
|                                                Mean Throughput | search_recap no globals |  15.32        |  ops/s |
|                                              Median Throughput | search_recap no globals |  15.32        |  ops/s |
|                                                 Max Throughput | search_recap no globals |  15.47        |  ops/s |
|                                        50th percentile latency | search_recap no globals | 186.94        |     ms |
|                                        90th percentile latency | search_recap no globals | 237.895       |     ms |
|                                        99th percentile latency | search_recap no globals | 272.522       |     ms |
|                                       100th percentile latency | search_recap no globals | 291.286       |     ms |
|                                   50th percentile service time | search_recap no globals | 186.94        |     ms |
|                                   90th percentile service time | search_recap no globals | 237.895       |     ms |
|                                   99th percentile service time | search_recap no globals | 272.522       |     ms |
|                                  100th percentile service time | search_recap no globals | 291.286       |     ms |
|                                                     error rate | search_recap no globals |   0           |      % |

eager_global_ordinals enabled.

  |                                                 Min Throughput | search_recap eager global |  15.39        |  ops/s |
|                                                Mean Throughput | search_recap eager global |  15.43        |  ops/s |
|                                              Median Throughput | search_recap eager global |  15.42        |  ops/s |
|                                                 Max Throughput | search_recap eager global |  15.54        |  ops/s |
|                                        50th percentile latency | search_recap eager global | 183.913       |     ms |
|                                        90th percentile latency | search_recap eager global | 229.807       |     ms |
|                                        99th percentile latency | search_recap eager global | 248.07        |     ms |
|                                       100th percentile latency | search_recap eager global | 256.371       |     ms |
|                                   50th percentile service time | search_recap eager global | 183.913       |     ms |
|                                   90th percentile service time | search_recap eager global | 229.807       |     ms |
|                                   99th percentile service time | search_recap eager global | 248.07        |     ms |
|                                  100th percentile service time | search_recap eager global | 256.371       |     ms |
|                                                     error rate | search_recap eager global |   0           |      % |

This benchmark was performed with the cache enabled to see if there was any advantage in warming up the global ordinals. The average throughput is almost the same. However, without using the setting, the slowest request took 291.286ms, while with the option enabled, the slowest request took 256.371ms. No a great difference, but it could means that warming up global ordinals could offer a slight boost the first time a query is performed, with the tradeoff of increasing heap memory usage.

These findings, along with those reported in the previous comment, were the potential changes that could lead to a performance boost in search queries and that might require a change on the document mapping or re_index. However, none of them seem to indicate a real improvement, or if there is an improvement, it affects other essential features like highlighting.

Query Profiling in Production

@mlissner to confirm if there is anything else we can do to improve search performance. Could we run the following profiles on Kibana?

GET recap_index_vector/_search?request_cache=false

As body, apply each of the following queries: normal_query.json single_child_field_query.json

In that way we could see how's queries working on prod and see if we can try something different to improve their performance.

Search performance suggestions not related to indexing.

In the article, there are other tweaks not related to indexing or mapping changes but to the filesystem and hardware that we could also consider.

Give memory to the filesystem cache

Avoid page cache thrashing by using modest readahead values on Linux

Use faster hardware

mlissner commented 7 months ago

Let's go through the performance stuff together, Alberto. I'll find a minute, maybe tomorrow.

Did you see this comment about working around the phrase indexing bug:

https://github.com/elastic/elasticsearch/issues/40227#issuecomment-504630378

albertisfu commented 7 months ago

Let's go through the performance stuff together, Alberto. I'll find a minute, maybe tomorrow.

Great, thanks!

Did you see this comment about working around the phrase indexing bug:

Yeah, I'll test this workaround to see if it can be a solution for us since it's mentioned that highlighting for phrases doesn't work completely well using this approach.

albertisfu commented 7 months ago

I tried different approaches related to the workaround mentioned in the issue comment.

The main problem appears to be that index_phrases is not compatible with phrase highlighting because the query matches phrases using shingles stored in a different field, as described in the documentation. Meanwhile, highlighting is applied to the main field.

So, the workaround that worked best involves using the highlight_query setting with a multi_match query, since in our use case, we need to query multiple fields.

This approach is works because the multi_match query does not search the exact field, unlike the query_string with the option "quote_field_suffix": ".exact", which uses the exact version of the field to match phrases. Thus, it utilizes the normal version of the field (which does not have the index_phrases option enabled) to search for terms and apply highlighting.

 "highlight_query":{
                           "bool":{
                              "should":[
                                 {
                                    "multi_match":{
                                       "fields":[
                                          "case_name_full",
                                          "suitNature",
                                          "juryDemand",
                                          "cause",
                                          "assignedTo",
                                          "referredTo",
                                          "court",
                                          "court_id",
                                          "court_citation_string",
                                          "chapter",
                                          "trustee_str",
                                          "short_description",
                                          "plain_text",
                                          "document_type",
                                          "caseName^4.0",
                                          "docketNumber^3.0",
                                          "description^2.0"
                                       ],
                                       "query":"\"apple inc\""
                                    }
                                 },
                                 {
                                    "multi_match":{
                                       "fields":[
                                          "case_name_full",
                                          "suitNature",
                                          "juryDemand",
                                          "cause",
                                          "assignedTo",
                                          "referredTo",
                                          "court",
                                          "court_id",
                                          "court_citation_string",
                                          "chapter",
                                          "trustee_str",
                                          "short_description",
                                          "plain_text",
                                          "document_type",
                                          "caseName^4.0",
                                          "docketNumber^3.0",
                                          "description^2.0"
                                       ],
                                       "query":"\"apple inc\"",
                                       "type":"phrase"
                                    }
                                 }
                              ],
                              "minimum_should_match":1
                           }

The problem with this approach is that it only works partially, as it highlights single terms instead of the entire phrase. As a result, the terms highlighted are not the same as those by which the main query matched the document.

Here is an example:

This is applying the the highlight_query workaround. Highlights are displayed, but instead of being applied to the phrase "apple inc," they are applied independently to the terms "apple" or "inc." Screenshot 2024-02-27 at 16 32 23

This is the current approach does not use index_phrases, and the difference is that the phrases are highlighted as expected. Screenshot 2024-02-27 at 16 33 13

Performance.

I also applied a benchmark to this workaround, and these are the results: track_index_phrase_workraround.json

|                                                 Min Throughput | search_recap phrase shingles | 203.42        |  ops/s |
|                                                Mean Throughput | search_recap phrase shingles | 203.42        |  ops/s |
|                                              Median Throughput | search_recap phrase shingles | 203.42        |  ops/s |
|                                                 Max Throughput | search_recap phrase shingles | 203.42        |  ops/s |
|                                        50th percentile latency | search_recap phrase shingles |  10.8529      |     ms |
|                                        90th percentile latency | search_recap phrase shingles |  17.0272      |     ms |
|                                        99th percentile latency | search_recap phrase shingles |  33.1628      |     ms |
|                                       100th percentile latency | search_recap phrase shingles |  41.537       |     ms |
|                                   50th percentile service time | search_recap phrase shingles |  10.8529      |     ms |
|                                   90th percentile service time | search_recap phrase shingles |  17.0272      |     ms |
|                                   99th percentile service time | search_recap phrase shingles |  33.1628      |     ms |
|                                  100th percentile service time | search_recap phrase shingles |  41.537       |     ms |
|                                                     error rate | search_recap phrase shingles |   0           |      % |

We can confirm that part of the boost observed in yesterday's benchmark was due to the lack of highlighting. In this approach, with highlighting partially working, the throughput has decreased and is now closer to the original approach without index_phrases. track_phrase_normal.json

|                                                 Min Throughput | search_recap phrase normal | 187.55        |  ops/s |
|                                                Mean Throughput | search_recap phrase normal | 188.03        |  ops/s |
|                                              Median Throughput | search_recap phrase normal | 187.55        |  ops/s |
|                                                 Max Throughput | search_recap phrase normal | 188.99        |  ops/s |
|                                        50th percentile latency | search_recap phrase normal |  13.7903      |     ms |
|                                        90th percentile latency | search_recap phrase normal |  20.0521      |     ms |
|                                        99th percentile latency | search_recap phrase normal |  29.4487      |     ms |
|                                       100th percentile latency | search_recap phrase normal |  34.999       |     ms |
|                                   50th percentile service time | search_recap phrase normal |  13.7903      |     ms |
|                                   90th percentile service time | search_recap phrase normal |  20.0521      |     ms |
|                                   99th percentile service time | search_recap phrase normal |  29.4487      |     ms |
|                                  100th percentile service time | search_recap phrase normal |  34.999       |     ms |
|                                                     error rate | search_recap phrase normal |   0           |      % |

So, in conclusion, there appears to still exists a slight advantage to the index_phrases approach. However, this advantage comes at the expense of highlights on phrases not fully working, which can confuse users when viewing the results.

mlissner commented 7 months ago

Thanks Alberto. Having been confused by wrong highlighting in our Solr implementation, I think it's definitely not worth it.

But, roughly, how much faster are things without highlighting altogether? Maybe it's worth just turning off highlighting completely? Is it adding that much functionality to the results? Maybe it's not? I just checked Bing, Kagi, and Google. None use highlighting anymore.

mlissner commented 7 months ago

Sorry, correction:

Bing highlights like a maniac (like whole phrases it thinks might be useful).
Kagi does not highlight.
Google does highlight some results
DuckDuckGo does highlighting

albertisfu commented 7 months ago

But, roughly, how much faster are things without highlighting altogether?

Sure, here are some numbers.

I conducted various tests because the time it takes to highlight depends on the size of the documents being matched; if the documents matched are large, highlighting takes more time.

For instance.

Query: apple Among the results, there is a document that contains more than 10,000 characters.

HL enabled.

|                                                 Min Throughput | search_recap normal query HL |  15.08        |  ops/s |
|                                                Mean Throughput | search_recap normal query HL |  15.14        |  ops/s |
|                                              Median Throughput | search_recap normal query HL |  15.16        |  ops/s |
|                                                 Max Throughput | search_recap normal query HL |  15.23        |  ops/s |

HL disabled.

|                                                 Min Throughput | search_recap normal query no HL | 73.13        |  ops/s |
|                                                Mean Throughput | search_recap normal query no HL | 73.65        |  ops/s |
|                                              Median Throughput | search_recap normal query no HL | 73.47        |  ops/s |
|                                                 Max Throughput | search_recap normal query no HL | 74.36        |  ops/s |

In this example, we can observe significantly better performance when highlighting is disabled

Another example.

Query: states No big document within results.

HL enabled.

|                                                 Min Throughput | search_recap states query HL | 101.79        |  ops/s |
|                                                Mean Throughput | search_recap states query HL | 102.43        |  ops/s |
|                                              Median Throughput | search_recap states query HL | 102.6         |  ops/s |
|                                                 Max Throughput | search_recap states query HL | 102.89        |  ops/s |

HL disabled.

|                                                 Min Throughput | search_recap states query no HL | 190.92        |  ops/s |
|                                                Mean Throughput | search_recap states query no HL | 192.94        |  ops/s |
|                                              Median Throughput | search_recap states query no HL | 192.94        |  ops/s |
|                                                 Max Throughput | search_recap states query no HL | 194.96        |  ops/s |

In this example, the version without highlighting is faster; however, the difference is not as significant as we observed in the first example.

Phrase search.

Query: "apple inc"

HL enabled, normal index.

|                                                 Min Throughput | search_recap normal phrase HL | 196.73        |  ops/s |
|                                                Mean Throughput | search_recap normal phrase HL | 198.05        |  ops/s |
|                                              Median Throughput | search_recap normal phrase HL | 198.05        |  ops/s |
|                                                 Max Throughput | search_recap normal phrase HL | 199.38        |  ops/s |

HL disabled, normal index.

|                                                 Min Throughput | search_recap normal phrase NO HL | 385.41        |  ops/s |
|                                                Mean Throughput | search_recap normal phrase NO HL | 385.41        |  ops/s |
|                                              Median Throughput | search_recap normal phrase NO HL | 385.41        |  ops/s |
|                                                 Max Throughput | search_recap normal phrase NO HL | 385.41        |  ops/s |

HL disabled, index_phrases index.

|                                                 Min Throughput | search_recap shingles phrase NO HL | 480.15        |  ops/s |
|                                                Mean Throughput | search_recap shingles phrase NO HL | 491.01        |  ops/s |
|                                              Median Throughput | search_recap shingles phrase NO HL | 491.57        |  ops/s |
|                                                 Max Throughput | search_recap shingles phrase NO HL | 500.07        |  ops/s |

From these results, we can see how the version without highlighting performs on a search phrase, and its performance is slightly better when the index_phrases option is used simultaneously.

In general, we can observe that, for queries that do not return large documents, the performance after disabling HL is twice as fast. For queries that return large documents, disabling highlighting significantly improves performance; however, it depends on the number of documents and their size.

However, there is an additional variable we should also consider: the number of documents in the index.

I performed these benchmarks on a small index with around 20,000 documents. The size of the highlighting component in production might vary due to the number of documents.

Therefore, we need to at least perform some query profiling to determine which query components are taking more time in production.

To do this, I'm attaching some profiles that could be useful to get better insights into how highlighting is impacting production.

GET recap_index_vector/_search?request_cache=false

medium-query-no-hl.json medium-query-hl.json big-query-hl.json big-query-no-hl.json big-documents-query-no-hl copy.json big-documents-query-hl copy.json

albertisfu commented 7 months ago

Two additional profiles should be performed in production; these are related to count queries so that we can determine how they impact the overall request time on prod

dockets_count_query.json fillings_count_query.json

mlissner commented 7 months ago

OK, here are your responses, I believe. Note that one failed (big-query-hl.json):

albertisfu commented 7 months ago

Great, thank you. I'll analyze the results and get back to you with my findings so we can start the sweep indexer.

albertisfu commented 6 months ago

@mlissner I've analyzed the profile responses and omitted the query profile that failed.

At first sight, I thought there was something wrong with production because the profile sizes were huge compared to those from my local environment. However, that's not the case. The reason profiles are huge on production is that it reports a profile for each shard in the cluster, resulting in 30 profiles per file.

I created two scripts to analyze the profiles because, due to the size of each file, it is complex to analyze them by hand.

The first script: analyze_query_components.txt

Extracted the key components of the profile and computed the total time in milliseconds per shard. Here are the results: big-documents-query-hl_analize.json big-documents-query-no-hl_analize.json big-query-no-hl_analyze.json dockets_count_query_analize.json fillings_count_query_analize.json medium-query-hl_analize.json medium-query-no-hl_analize.json

Here is the breakdown for one shard for the query: big-documents-query-hl

    {
        "id": "[3zrjVUmdQbejwIdnw1XrAA][recap_vectors][1]",
        "searches": [
            {
                "query": [
                    {
                        "type": "BooleanQuery",
                        "time_ms": 245.673662
                    }
                ],
                "rewrite_time": 626.626444,
                "total_time_search": 872.300106
            }
        ],
         "fetch_time_ms": 4133.962441
        "total_time_shard": 5006.262546999999,

    }

We can observe that the BooleanQuery time totals around 245 ms. The rewrite_time is around 626 ms, this is the time elasticsearch takes to optimize the query. The fetch_time_ms is around 4,000 ms, which includes HL (Highlighting) and retrieving documents from the index.

From this, we can see that the most time expended on an ES request is spent on HL and retrieving documents. The query time is not significant, but if required.

A review of the query might be especially useful for queries that match millions of documents at once. For instance, the big-query-no-hl-response query matches 53,089,192 Cases and 294,689,707 RECAPDocuments.

It has the following breakdown for a shard:

{
        "id": "[3zrjVUmdQbejwIdnw1XrAA][recap_vectors][4]",
        "searches": [
            {
                "query": [
                    {
                        "type": "BooleanQuery",
                        "time_ms": 5503.596591
                    }
                ],
                "rewrite_time": 5252.904828,
                "total_time_search": 10756.501419
            }
        ],
        "total_time_shard": 10756.501419
    }

This query is not applying HL (the HL version of the same query was the one that timed out). So, we can see that when matching a massive amount of documents, the query time can also be expensive. Here, we could later try to find if it's possible to disable or remove some query clauses to simplify the overall query while preserving the quality of results. This review won't require a reindex.

The second scripts prints the total time per shard in milliseconds and adds some stats at the end, useful for their analysis: analyze_profiles_totals.txt

{
        "average_shard_time_ms": 9689.299090133332,
        "max_time_shard": "[nmLgdM-aSDCkZNmyVoOodg][recap_vectors][19]",
        "max_time_shard_time_ms": 20040.880836,
        "global_time_ms": 290678.97270399996,
        "shard_count": 30,
        "took_ms": 18206
    }

It returns:

The average processing time for all the shards.
The shard ID that took more time.
The shard time in ms that took more time.
The accumulated time for all shards.
The number of shards that proccesed the query.
The "took" time in MS reported by the query.

Here are the results: big-query-no-hl-simplified-totals.json big-documents-query-hl-simplified-totals.json big-documents-query-no-hl-simplified-totals.json dockets_count_query-simplified-totals.json fillings_count_query-simplified-totals.json medium-query-hl-simplified-totals.json medium-query-no-hl-simplified-totals.json

As conclusion I can see two reasons why the performance is being affected in prod.

Highlighting.

From the production profiles we can confirm that highlighting is expensive, specially on certain escenarios.

big-documents-query-hl profile This query matched documents with large plain_text fields, so we expected highlighting to be expensive here. The metrics of interest are: "took_ms": 8419 "max_time_shard_time_ms": 5870.834745 "average_shard_time_ms": 2226.520687433333

We can see the query took time is around 8 seconds. While the average time it took to process the query on each shard was around 2 seconds, considering that shard operations run in parallel, the global query time should be closer to the time taken by the shard that takes the longest to process the query, which was close to 6 seconds.

The actual took time is longer, which makes sense since profiling introduces additional overhead and some other components that the profile doesn't measure, such as the time spent in the queue, merging shard responses on the coordinating node, and additional work described in documentation.

big-documents-query-no-hl profile This is the same previous query but without HL.

Metrics of interest are: "took_ms": 1667 "max_time_shard_time_ms": 1341.438036 "average_shard_time_ms": 834.2702551

From these metrics, we can see that in the worst-case scenario, when highlighting is too expensive, a query without HL could be around 5 times faster.

medium-query-hl profile This is a query that matched 243,854 Cases and 1,571,324 Docket Entries Not targeted to match big documents. HL enabled.

Metrics of interest are: "took_ms": 2532 "max_time_shard_time_ms": 1752.211235 "average_shard_time_ms": 638.510290

medium-query-no-hl profile Same query HL disabled.

Metrics of interest are: "took_ms": 1257 "max_time_shard_time_ms": 1269.419189 "average_shard_time_ms": 961.2542206

Since this can be a regular query, we can see that on average, disabling HL can make a query twice as fast.

It's important to note that removing HL will also eliminate the possibility of showing snippets with the plain_text content that was matched by the query.

Count queries.

In RECAP, in addition to the main query, we perform two additional queries: - Dockets matched count query

Dockets count: 53,087,704 "took_ms": 12181 "max_time_shard_time_ms": 9087.384 "average_shard_time_ms": 6652.13974

- RECAPDocuments count query. Documents count: 294,651,169 "took_ms": 7810 "max_time_shard_time_ms": 25611.872 "average_shard_time_ms": 5075.93326

From these results, we can observe that count queries are also expensive, especially when they match a large number of documents. Something strange to note is that in the RECAPDocuments query, it reports that a shard took around 25 seconds to complete the operation, while the global took time is just around 7 seconds, which doesn't make sense. Thus, the real took time here could be even longer.

Even though these count queries and the main query are executed in parallel, they could be impacting the overall speed for the user in case a count query takes longer than the main query. Additionally, these extra queries being executed simultaneously might be impacting the overall cluster response time.

Something we can do is add a waffle that allows us to test queries without performing count queries, so we can compare the performance improvement directly in production.
Another suggestion Mike proposed is that we could also try to terminate the count queries if they last more than a short amount of time and just show something like +10,000 documents.

Number of shards.

From these profiles, we can observe that the time each shard takes to process the query varies significantly, depending primarily on two factors:

The total number of documents stored in the shard.
If doing HL, the number of documents matched that require HL.

However, it's not straightforward to determine whether increasing or decreasing the number of shards will help with performance. Increasing the number of shards leads to smaller shards, which are faster to query, but to see a real improvement, this would also require increasing the number of nodes proportionally, so operations can be parallelized. Otherwise, increasing the number of shards could lead to a degradation in performance since it adds overhead to nodes in addition to the coordination and merging results time.

In small clusters, fewer shards could work better due to reduced merging time and coordination effort. https://stackoverflow.com/a/66624665

If we want to explore this alternative, it will require testing multiple configurations to determine what is best according to our setup and resources. Changing the number of shards will necessitate the creation of a new index, and the content can be copied using the re_index API.

Unless you want to test changing the number of shards (which might impact the re_index). I think we're ready for starting the new re index once #3871 is merged.

mlissner commented 6 months ago

Thanks. This is a lot to think about, but it seems like the results are:

Highlighting is expensive, but we can't turn it off because then we'd lose snippets.
Counts are expensive. It looks like we might be able to use track_total_hits to help with this though.

We discussed this a bit in https://github.com/freelawproject/courtlistener/pull/3648, but I feel like we didn't consider just setting it to an integer and allowing the front end to do the More than 1,000 hits kind of thing.
We could tinker with shards. More might help, or not.

Something else we haven't discussed (because it's better to think about this after we've optimized the software) is optimizing hardware. I don't actually know where our bottleneck is, whether it's CPU, disk, or networking. I just looked and things seem sort of OK to me, but I guess we can ask Ramiro to do this soon.

mlissner commented 6 months ago

Anyway, bottom line, I agree we can set this aside for later.

mlissner commented 2 months ago

I spun off my summarizing comment above into https://github.com/freelawproject/courtlistener/issues/4209, where we can continue this in a fresh issue. Closing this mega conversation with my appreciation and gratitude.

freelawproject / courtlistener

Assess Elasticsearch Query Performance #3536

Benchmarking

In brief:

Profiling:

Benchmarking:

Search as Few Fields as Possible

Consider mapping identifiers as keyword

Faster phrase queries with index_phrases

Warm up global ordinals

Query Profiling in Production

Search performance suggestions not related to indexing.

Performance.

Phrase search.

Highlighting.

Count queries.

Number of shards.

Faster phrase queries with `index_phrases`