Database improvements - Githubissues

toasted-nutbread commented 4 years ago

This is intended to be an overview of various issues and tasks related to the database, as well as a venue for general discussion.

Database modifying operations

[x] Use beforeunload event to warn if the settings page is being closed while a database change operation is occurring (import, purge, delete, etc.). #272 This should help prevent corruptions/lingering data.
[x] Add ability to update dictionary. #86, #272 This can likely be accomplished by implementing a "Delete Dictionary" option.
[x] Import multiple dictionaries at once. #269, #272

Database health validation

[x] Detect when databases have been automatically purged by the browser. #100 This may be preventable with the right permissions (#256, #254), but a check should still be implemented.
[x] Show information about lingering data in the event of a partial database deletion. #272 E.g. the entry from the "dictionaries" object store has been deleted, but the terms/kanji definitions still exist.
[x] Some tweaks to how the settings page displays dictionary information. #272 In order to make it easier to more robustly show the information mentioned above.
[x] Indicate when the backend / database is upgrading. When the database version is upgrading, the extension won't run. This operation can take multiple seconds, so it can be noticeable. This was the case in the version upgrade in #287, despite the difference being very small.
[x] Sanitization of dictionary data during the import process.

Extended database functionality

[x] Show entries that match the selected expression plus wildcards. #128, #274 This may be difficult, since all IndexedDB has to work with is IDBKeyRange.bound. It may be possible to use this to some extent, but it is very limited. Maybe only enabling on the search page since it's likely to be a more expensive operation.
[x] Custom dictionary with images. #322
[x] Documentation of the dictionary formats. #322
[x] Pitch accents. #61, #131

Other

[x] Use IndexedDB APIs directly for everything, rather than using dexie. #278 This one is certainly up for debate, but from what I've seen, dexie largely obscures how IndexedDB works, which causes sub-optimal performance since it's not a 1-to-1 mapping. These performance issues have largely been addressed for real-time lookups, but now the Database class is in this weird state where half of it uses dexie and half of it uses the IndexedDB API. Might be better to just make everything use IndexedDB APIs for consistency.
Details on some previous issues with IndexedDB: https://github.com/FooSoft/yomichan/issues/86#issuecomment-332695915
[ ] Dictionary data normalization to enable more loose searching and reduce the amount of queries. https://github.com/FooSoft/yomichan/issues/461

toasted-nutbread commented 4 years ago

Preliminary info about deleting single dictionaries:

Using the English JMdict as the baseline, there is a fairly wide range for the amount of time it takes to delete the single dictionary. This is largely based on how it's implemented. All methods I've tested have been using native IndexedDB APIs; the least efficient method has taken ~120 seconds, and the most efficient method takes about ~25 seconds. For reference, the time it took to import the dictionary was ~42 seconds in Firefox, and ~52 seconds on Chrome.

Interestingly, Chrome's dictionary import time seemed to get slower and slower with repeated testing, even after restarting the browser. I was seeing import times upwards of 170 seconds. This was the case even if the database was purged or manually cleared. I did not test deleting the entire "dict" database. Reinstalling the extension made import times return to the more normal 50 second range.

siikamiika commented 4 years ago

Show entries that match the selected expression plus wildcards

While implementing a text parser that works by querying words from the DB, I noticed that long queries are very inefficient. The parsing works so that it first looks up the beginning of the search, and eats every word it finds before starting a new search. First I thought that the DB is too slow in general, but then I tried limiting each query to options.scanning.length and it worked https://github.com/siikamiika/yomichan/commit/8944b5aed5b0afb5a6eb690b3861bd81a85260c0. Is that related to how the queries are implemented in Yomichan (does it look up each substring starting from the first character) or something about IndexedDB? I wonder if this could be optimized so that at some reasonable length such as 4 characters Yomichan switches to wildcard search and filters out results that don't match the query.

Chrome's dictionary import time seemed to get slower and slower with repeated testing, even after restarting the browser. I was seeing import times upwards of 170 seconds. This was the case even if the database was purged or manually cleared.

This is very interesting information, and also means that deleting individual dictionaries might be a viable alternative to DB purge. Have you found any open bugs about this at bugs.chromium.org?

toasted-nutbread commented 4 years ago

Is that related to how the queries are implemented in Yomichan (does it look up each substring starting from the first character) or something about IndexedDB

The way that scanning works is that it takes all of the substrings with lengths less than the scanning length, deinflects each, and then checks to see if each deinflection is present in the dictionary.

For example, given 抱きかかえていなければ, the following substrings are processed (possible deinflections are indented):

抱きかかえていなければ
    抱きかかえていなければ
    抱きかかえていなける
    抱きかかえていない
    抱きかかえていなく
    抱きかかえている
    抱きかかえて
    抱きかかえる
    抱きかかう
抱きかかえていなけれ
    抱きかかえていなけれ
    抱きかかえていなける
    抱きかかえていなけれる
    抱きかかえていなける
    抱きかかえていなく
抱きかかえていなけ
    抱きかかえていなけ
    抱きかかえていなく
    抱きかかえていなける
    抱きかかえていなく
抱きかかえていな
    抱きかかえていな
    抱きかかえてい
抱きかかえてい
    抱きかかえてい
    抱きかかえてる
    抱きかかえている
    抱きかかえてう
    抱きかかえて
    抱きかかえる
    抱きかかう
抱きかかえて
    抱きかかえて
    抱きかかえる
    抱きかかえつ
    抱きかかえてる
    抱きかかう
    抱きかかえつ
    抱きかかえて
    抱きかかえる
    抱きかかう
抱きかかえ
    抱きかかえ
    抱きかかう
    抱きかかえる
    抱きかかう
抱きかか
    抱きかか
抱きか
    抱きか
抱き
    抱き
    抱きる
    抱く
    抱くる
抱
    抱

(Note that this word is 11 characters, so by Yomichan will not detect the full word with its default scan length of 10.)

Subsequently, every deinflection must be checked to see if it is in the database.

Have you found any open bugs about this at bugs.chromium.org?

I honestly didn't look into it that much; I just noticed Chrome getting slower with repeated testing. If I can verify it's fully reproducible on multiple versions of Chrome, I may start looking at bug reports. I also didn't stress test to see if this behaviour is on Firefox or other browsers.

siikamiika commented 4 years ago

For example, given 抱きかかえていなければ, the following substrings are processed (possible deinflections are indented):

Right, I forgot that of course you can also inflect words indefinitely, like なくなくなくなく...ない that becomes negative « negative « negative « ... « negative. However, if you wanted to do precomputation in a way that would allow lookups to start from the beginning and switching to wildcard at some point, you could store all possible single inflections in the DB in a separate table and on matches perform inflection until the longest match instead of deinflection. I think that would reduce the amount of queries a lot, but also increase the DB size.

Given the problems with IndexedDB in general with deleting data and user dictionaries getting purged without notice, maybe it's better not to fix what isn't broken. After your previous DB optimizations it's already looking like DB isn't bottlenecking the operation too much in normal usage. I was thinking of the edge cases.

toasted-nutbread commented 4 years ago

https://dev.to/skhmt/why-are-indexeddb-operations-significantly-slower-in-chrome-vs-firefox-1bnd https://bugs.chromium.org/p/chromium/issues/detail?id=233529 https://bugs.chromium.org/p/chromium/issues/detail?id=466053

Here's some more fun profiling: Doing IDBIndex.count operations, with and without a range query, are orders of magnitude faster in Firefox on the same dataset.

Timing tests with only KANJIDIC and JMdict (English) installed:

For each dictionary, count number of entries in each available object store.
Count total entries in each available object store. Single transaction, not filtered based on dictionary.
"Simultaneous" 1. and 2.

Firefox timings:

33ms
25ms
53ms

Chrome timings:

1990ms
1824ms
3649ms

This also seems to be in line with my "perceived speed" that text scanning takes in Firefox vs Chrome. In Firefox, rapidly moving the mouse with the scan key held has near instant lookup on every character the cursor moves over. In Chrome, it feels more laggy.

siikamiika commented 4 years ago

Thanks for the research. I followed some links and found that Mozilla has been against the competing Web SQL proposal which is now deprecated, and that could explain why it has invested more in IndexedDB. Maybe if it would make sense to create two classes for a "DB connector" where one of them uses Web SQL and another uses IndexedDB, because the insert slowdown doesn't seem to happen for that on Chrome when tested here.

Anyway, general usability still seems quite good to me on both browsers, and especially Firefox performance has improved compared to what it was before.

toasted-nutbread commented 4 years ago

Anyway, general usability still seems quite good to me on both browsers, and especially Firefox performance has improved compared to what it was before.

Yeah, I'm not overly concerned with trying to fix it, but it's at least worth knowing about. One other thing I noticed while testing on Chrome is that Dexie will sometimes fail to purge the database due to the operation being "blocked". Not sure what the cause would be, but it seems that this results in the database still taking up the same amount of memory, even though inspection of the database shows that there are no entries and no dictionaries show up as installed.

siikamiika commented 4 years ago

One easy way to get some performance improvement would be to cache the results of apiTermsFind here, using searchText as the key: https://github.com/FooSoft/yomichan/blob/83460bcdade28fe77908d3008444a23555f09487/ext/fg/js/frontend.js#L406-L418

Doing it in bg would be better in terms of memory consumption and reusing the same cache on other pages, but there's also the cost of serializing data in backend and de-serializing in the frontend. Many devices would surely run out of memory after a while, so some form of cache eviction such as LRU or FIFO could be used.

The caches would have to be dropped after importing or deleting dictionaries.

toasted-nutbread commented 4 years ago

One easy way to get some performance improvement would be to cache the results of apiTermsFind here, using searchText as the key:

I think I actually did that at some point a few months ago, but I never pushed it in because database lookups were sped up so much. It even had eviction criteria such as max lifetime and max size like you mentioned.

siikamiika commented 4 years ago

Haha, that's a funny coincidence. :smile: I did some testing and the speedup was noticeable when using more than one dictionary and enabling secondary searches for a dictionary, but I didn't notice much difference with English JMDict only.

seanblue commented 4 years ago

Is it still infeasible to delete an individual dictionary? What about reordering dictionaries? I can create a new Github issue for that if there's interest.

toasted-nutbread commented 4 years ago

Dictionaries are able to be deleted using the latest testing release, but this change hasn't been worked into a stable release yet. While it works, it's not a quick operation, and can slow down the browser window while the deletion is in progress.

As for reordering dictionaries, this can already (kind of) be accomplished using the priority system in the settings page. You have to enable the option "Show advanced options", and then each dictionary entry has a text field for priority. I'm not sure if this accomplishes what you want, so let me know if you mean something else.

toasted-nutbread commented 3 years ago

Resolving since most improvements listed are complete.

FooSoft / yomichan

Database improvements #270