NHMDenmark / Mass-Digitizer

Common repo for the DaSSCo team
Apache License 2.0
1 stars 0 forks source link

Speed improvement in taxonomy auto-complete v1.1.22 #462

Closed jlegind closed 6 months ago

jlegind commented 6 months ago

Describe the bug I think the auto suggest for taxonomy is borderline slow. Though for nearly 2.5 million rows it is still pretty decent.

To Reproduce Input a taxon name having several versions with different authors, like: "Abacopteris lineata"

Additional context A quick fix could be to only allow wildcard search from the end of the inputted keystrokes:

            self.tableName = 'taxonname'
            fields = {'fullname': f'LIKE lower("{keyStrokes}%")',`

I realize that the auto-suggest from anywhere in the string became a requirement, I still submit this as a fix. It is a quick roll back if there is dissatisfaction from the digitizer staff. We should set index for the taxonname table to speed up queries. This web page has some good suggestions: https://developer.android.com/topic/performance/sqlite-performance-best-practices

FedorSteeman commented 6 months ago

The fix suggested by @jlegind did not lead to noticeable improvement in performance.

I will attempt to install a module called FTS5 into the sqlitefile that has better prospects.

https://www.sqlite.org/fts5.html

FedorSteeman commented 6 months ago

Apparently, FTS5 is already part of sqlite out of the box.

I followed the instructions and created a virtual table referring to the original taxonname table and the performance is drastically improved:

SELECT * FROM taxonname WHERE fullname LIKE '%Draba%';          -- Result: 4435 rows returned in 592ms
SELECT * FROM taxonname_fts WHERE fullname MATCH 'Draba';       -- Result: 4294 rows returned in 10ms
SELECT * FROM taxonname WHERE id IN (
    SELECT id FROM taxonname_fts WHERE fullname MATCH 'Draba'); -- Result: 4294 rows returned in 38ms

There are differences in the number of rows returned, because FTS tokenizes individual words instead of partial entries. This means for the above case it will skip entries like "Xerodraba". However, it is possible to use wild cards with no significant decrease in performance:

SELECT * FROM taxonname WHERE id IN (
    SELECT id FROM taxonname_fts WHERE fullname MATCH 'Draba*'); -- Result: 4304 rows returned in 49ms

I can try and tweak the wild cards some more...

jlegind commented 6 months ago

The FTS improvement looks fantastic. BTW, Do we have an index on the taxon fullname column?

FedorSteeman commented 6 months ago

@jlegind Yes, I already tried with index on that column, but with little result. The FTS option really does the trick, though results aren't identical.

I found that performance is better with the number of keystrokes triggering the query being 4 instead of the original 3.

However, I'd need @bhsi-snm or @PipBrewer to sanction this change if I am to proceed with it. Otherwise I'll revert to 3 again.

FedorSteeman commented 6 months ago

Never mind: I found that adding a limit of 200 to the query dramatically improves performance and 3 characters is fine.