FooSoft / yomichan

Japanese pop-up dictionary extension for Chrome and Firefox.
https://foosoft.net/projects/yomichan
Other
1.06k stars 213 forks source link

Database improvements #270

Closed toasted-nutbread closed 3 years ago

toasted-nutbread commented 4 years ago

This is intended to be an overview of various issues and tasks related to the database, as well as a venue for general discussion.

Database modifying operations

Database health validation

Extended database functionality

Other

toasted-nutbread commented 4 years ago

Preliminary info about deleting single dictionaries:

Using the English JMdict as the baseline, there is a fairly wide range for the amount of time it takes to delete the single dictionary. This is largely based on how it's implemented. All methods I've tested have been using native IndexedDB APIs; the least efficient method has taken ~120 seconds, and the most efficient method takes about ~25 seconds. For reference, the time it took to import the dictionary was ~42 seconds in Firefox, and ~52 seconds on Chrome.

Interestingly, Chrome's dictionary import time seemed to get slower and slower with repeated testing, even after restarting the browser. I was seeing import times upwards of 170 seconds. This was the case even if the database was purged or manually cleared. I did not test deleting the entire "dict" database. Reinstalling the extension made import times return to the more normal 50 second range.

siikamiika commented 4 years ago

Show entries that match the selected expression plus wildcards

While implementing a text parser that works by querying words from the DB, I noticed that long queries are very inefficient. The parsing works so that it first looks up the beginning of the search, and eats every word it finds before starting a new search. First I thought that the DB is too slow in general, but then I tried limiting each query to options.scanning.length and it worked https://github.com/siikamiika/yomichan/commit/8944b5aed5b0afb5a6eb690b3861bd81a85260c0. Is that related to how the queries are implemented in Yomichan (does it look up each substring starting from the first character) or something about IndexedDB? I wonder if this could be optimized so that at some reasonable length such as 4 characters Yomichan switches to wildcard search and filters out results that don't match the query.

Chrome's dictionary import time seemed to get slower and slower with repeated testing, even after restarting the browser. I was seeing import times upwards of 170 seconds. This was the case even if the database was purged or manually cleared.

This is very interesting information, and also means that deleting individual dictionaries might be a viable alternative to DB purge. Have you found any open bugs about this at bugs.chromium.org?

toasted-nutbread commented 4 years ago

Is that related to how the queries are implemented in Yomichan (does it look up each substring starting from the first character) or something about IndexedDB

The way that scanning works is that it takes all of the substrings with lengths less than the scanning length, deinflects each, and then checks to see if each deinflection is present in the dictionary.

For example, given 抱きかかえていなければ, the following substrings are processed (possible deinflections are indented):

抱きかかえていなければ
    抱きかかえていなければ
    抱きかかえていなける
    抱きかかえていない
    抱きかかえていなく
    抱きかかえている
    抱きかかえて
    抱きかかえる
    抱きかかう
抱きかかえていなけれ
    抱きかかえていなけれ
    抱きかかえていなける
    抱きかかえていなけれる
    抱きかかえていなける
    抱きかかえていなく
抱きかかえていなけ
    抱きかかえていなけ
    抱きかかえていなく
    抱きかかえていなける
    抱きかかえていなく
抱きかかえていな
    抱きかかえていな
    抱きかかえてい
抱きかかえてい
    抱きかかえてい
    抱きかかえてる
    抱きかかえている
    抱きかかえてう
    抱きかかえて
    抱きかかえる
    抱きかかう
抱きかかえて
    抱きかかえて
    抱きかかえる
    抱きかかえつ
    抱きかかえてる
    抱きかかう
    抱きかかえつ
    抱きかかえて
    抱きかかえる
    抱きかかう
抱きかかえ
    抱きかかえ
    抱きかかう
    抱きかかえる
    抱きかかう
抱きかか
    抱きかか
抱きか
    抱きか
抱き
    抱き
    抱きる
    抱く
    抱くる
抱
    抱

(Note that this word is 11 characters, so by Yomichan will not detect the full word with its default scan length of 10.)

Subsequently, every deinflection must be checked to see if it is in the database.

Have you found any open bugs about this at bugs.chromium.org?

I honestly didn't look into it that much; I just noticed Chrome getting slower with repeated testing. If I can verify it's fully reproducible on multiple versions of Chrome, I may start looking at bug reports. I also didn't stress test to see if this behaviour is on Firefox or other browsers.

siikamiika commented 4 years ago

For example, given 抱きかかえていなければ, the following substrings are processed (possible deinflections are indented):

Right, I forgot that of course you can also inflect words indefinitely, like なくなくなくなく...ない that becomes negative « negative « negative « ... « negative. However, if you wanted to do precomputation in a way that would allow lookups to start from the beginning and switching to wildcard at some point, you could store all possible single inflections in the DB in a separate table and on matches perform inflection until the longest match instead of deinflection. I think that would reduce the amount of queries a lot, but also increase the DB size.

Given the problems with IndexedDB in general with deleting data and user dictionaries getting purged without notice, maybe it's better not to fix what isn't broken. After your previous DB optimizations it's already looking like DB isn't bottlenecking the operation too much in normal usage. I was thinking of the edge cases.

toasted-nutbread commented 4 years ago

https://dev.to/skhmt/why-are-indexeddb-operations-significantly-slower-in-chrome-vs-firefox-1bnd https://bugs.chromium.org/p/chromium/issues/detail?id=233529 https://bugs.chromium.org/p/chromium/issues/detail?id=466053

Here's some more fun profiling: Doing IDBIndex.count operations, with and without a range query, are orders of magnitude faster in Firefox on the same dataset.

Timing tests with only KANJIDIC and JMdict (English) installed:

  1. For each dictionary, count number of entries in each available object store.
  2. Count total entries in each available object store. Single transaction, not filtered based on dictionary.
  3. "Simultaneous" 1. and 2.

Firefox timings:

  1. 33ms
  2. 25ms
  3. 53ms

Chrome timings:

  1. 1990ms
  2. 1824ms
  3. 3649ms

This also seems to be in line with my "perceived speed" that text scanning takes in Firefox vs Chrome. In Firefox, rapidly moving the mouse with the scan key held has near instant lookup on every character the cursor moves over. In Chrome, it feels more laggy.

siikamiika commented 4 years ago

Thanks for the research. I followed some links and found that Mozilla has been against the competing Web SQL proposal which is now deprecated, and that could explain why it has invested more in IndexedDB. Maybe if it would make sense to create two classes for a "DB connector" where one of them uses Web SQL and another uses IndexedDB, because the insert slowdown doesn't seem to happen for that on Chrome when tested here.

Anyway, general usability still seems quite good to me on both browsers, and especially Firefox performance has improved compared to what it was before.

toasted-nutbread commented 4 years ago

Anyway, general usability still seems quite good to me on both browsers, and especially Firefox performance has improved compared to what it was before.

Yeah, I'm not overly concerned with trying to fix it, but it's at least worth knowing about. One other thing I noticed while testing on Chrome is that Dexie will sometimes fail to purge the database due to the operation being "blocked". Not sure what the cause would be, but it seems that this results in the database still taking up the same amount of memory, even though inspection of the database shows that there are no entries and no dictionaries show up as installed.

siikamiika commented 4 years ago

One easy way to get some performance improvement would be to cache the results of apiTermsFind here, using searchText as the key: https://github.com/FooSoft/yomichan/blob/83460bcdade28fe77908d3008444a23555f09487/ext/fg/js/frontend.js#L406-L418

Doing it in bg would be better in terms of memory consumption and reusing the same cache on other pages, but there's also the cost of serializing data in backend and de-serializing in the frontend. Many devices would surely run out of memory after a while, so some form of cache eviction such as LRU or FIFO could be used.

The caches would have to be dropped after importing or deleting dictionaries.

toasted-nutbread commented 4 years ago

One easy way to get some performance improvement would be to cache the results of apiTermsFind here, using searchText as the key:

I think I actually did that at some point a few months ago, but I never pushed it in because database lookups were sped up so much. It even had eviction criteria such as max lifetime and max size like you mentioned.

siikamiika commented 4 years ago

Haha, that's a funny coincidence. :smile: I did some testing and the speedup was noticeable when using more than one dictionary and enabling secondary searches for a dictionary, but I didn't notice much difference with English JMDict only.

seanblue commented 4 years ago

Is it still infeasible to delete an individual dictionary? What about reordering dictionaries? I can create a new Github issue for that if there's interest.

toasted-nutbread commented 4 years ago

Dictionaries are able to be deleted using the latest testing release, but this change hasn't been worked into a stable release yet. While it works, it's not a quick operation, and can slow down the browser window while the deletion is in progress.

As for reordering dictionaries, this can already (kind of) be accomplished using the priority system in the settings page. You have to enable the option "Show advanced options", and then each dictionary entry has a text field for priority. I'm not sure if this accomplishes what you want, so let me know if you mean something else.

toasted-nutbread commented 3 years ago

Resolving since most improvements listed are complete.