LibriVox / librivox-catalog

LibriVox catalog and reader workflow application
https://librivox.org
MIT License
36 stars 17 forks source link

[Sanity Check] Run query once per search #207

Closed redrun45 closed 3 months ago

redrun45 commented 3 months ago

Not urgent, but if I'm on the right track, we can improve search performance by a fair bit.


It looks like controllers/catalog/Search.php is running each search query twice: once for the items on the current page, and then again to count the total number of matches for the search. There doesn't seem to be a clean way to count the total results, but here's one way.

We can't use a regular COUNT(*), because that will stop at our LIMIT of 25. So: the OVER() call (1, 2) lets us COUNT before we discard the other results. But: doing this on each of our UNION sets will give different results, which we'd need to add together selectively. Collecting them all in a sub-query (WITH results AS () gives us a combined set to count.

I don't know how much of a hack this is, and how much overhead it adds, but it at least seems much faster than running the entire search query twice.

garethsime commented 3 months ago

I'll throw in a "works on my machine" for the advanced search (quick search is broken, but that's expected as it hasn't been updated with the window functions yet). It seems to be correct in that it still returns the same results with the same pagination.

As far as performance goes, on my machine, in a very simple test, it does seem to work faster. ab is what I had handy, so I used that against my local install:

ab -n 100 --verbose 'https://librivox.org/advanced_search?title=wells&author=&reader=&keywords=&genre_id=0&status=all&project_type=either&recorded_language=&sort_order=catalog_date&search_page=1&search_form=advanced&q='

New query:

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        5    5   0.8      5      12
Processing:    73   76   2.1     75      86
Waiting:       73   76   2.1     75      86
Total:         78   81   2.5     80      98

Percentage of the requests served within a certain time (ms)
  50%     80
  66%     81
  75%     81
  80%     82
  90%     83
  95%     86
  98%     89
  99%     98
 100%     98 (longest request)

Old query:

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        5    5   0.4      5       8
Processing:   136  146  13.9    141     208
Waiting:      136  146  13.9    141     208
Total:        141  152  14.0    146     214

Percentage of the requests served within a certain time (ms)
  50%    146
  66%    151
  75%    155
  80%    157
  90%    163
  95%    171
  98%    214
  99%    214
 100%    214 (longest request)

I tried a few other searches, and the results were pretty much the same - the old one doing two searches took about twice as long as the new one doing one search.

I have no idea how to check the cost of a query on MariaDB à la EXPLAIN on Postgres, googling around it doesn't seem like a doable thing? Either way, that we're getting faster results implies we're probably tying up fewer DB resources.

I had a play around adding different indexes to the search_table table, but I only ended up making performance worse, and getting a p99 of 98ms is already going to be pretty hard to beat I think.

notartom commented 3 months ago

I'm sure there are many other things wrong with search that we can improve and fix and change, but as small incremental improvements go, this one is pretty good. @redrun45 you wrote "[Sanity check]", are you opposed to merging this?

redrun45 commented 3 months ago

@notartom if the technical direction is sound, then no, I'm not opposed. I just have to patch the simplesearch part too. I'll get a second commit ready soon!