Closed toasted-nutbread closed 3 years ago
Preliminary info about deleting single dictionaries:
Using the English JMdict as the baseline, there is a fairly wide range for the amount of time it takes to delete the single dictionary. This is largely based on how it's implemented. All methods I've tested have been using native IndexedDB APIs; the least efficient method has taken ~120 seconds, and the most efficient method takes about ~25 seconds. For reference, the time it took to import the dictionary was ~42 seconds in Firefox, and ~52 seconds on Chrome.
Interestingly, Chrome's dictionary import time seemed to get slower and slower with repeated testing, even after restarting the browser. I was seeing import times upwards of 170 seconds. This was the case even if the database was purged or manually cleared. I did not test deleting the entire "dict" database. Reinstalling the extension made import times return to the more normal 50 second range.
Show entries that match the selected expression plus wildcards
While implementing a text parser that works by querying words from the DB, I noticed that long queries are very inefficient. The parsing works so that it first looks up the beginning of the search, and eats every word it finds before starting a new search. First I thought that the DB is too slow in general, but then I tried limiting each query to options.scanning.length
and it worked https://github.com/siikamiika/yomichan/commit/8944b5aed5b0afb5a6eb690b3861bd81a85260c0. Is that related to how the queries are implemented in Yomichan (does it look up each substring starting from the first character) or something about IndexedDB? I wonder if this could be optimized so that at some reasonable length such as 4 characters Yomichan switches to wildcard search and filters out results that don't match the query.
Chrome's dictionary import time seemed to get slower and slower with repeated testing, even after restarting the browser. I was seeing import times upwards of 170 seconds. This was the case even if the database was purged or manually cleared.
This is very interesting information, and also means that deleting individual dictionaries might be a viable alternative to DB purge. Have you found any open bugs about this at bugs.chromium.org?
Is that related to how the queries are implemented in Yomichan (does it look up each substring starting from the first character) or something about IndexedDB
The way that scanning works is that it takes all of the substrings with lengths less than the scanning length, deinflects each, and then checks to see if each deinflection is present in the dictionary.
For example, given 抱きかかえていなければ
, the following substrings are processed (possible deinflections are indented):
抱きかかえていなければ
抱きかかえていなければ
抱きかかえていなける
抱きかかえていない
抱きかかえていなく
抱きかかえている
抱きかかえて
抱きかかえる
抱きかかう
抱きかかえていなけれ
抱きかかえていなけれ
抱きかかえていなける
抱きかかえていなけれる
抱きかかえていなける
抱きかかえていなく
抱きかかえていなけ
抱きかかえていなけ
抱きかかえていなく
抱きかかえていなける
抱きかかえていなく
抱きかかえていな
抱きかかえていな
抱きかかえてい
抱きかかえてい
抱きかかえてい
抱きかかえてる
抱きかかえている
抱きかかえてう
抱きかかえて
抱きかかえる
抱きかかう
抱きかかえて
抱きかかえて
抱きかかえる
抱きかかえつ
抱きかかえてる
抱きかかう
抱きかかえつ
抱きかかえて
抱きかかえる
抱きかかう
抱きかかえ
抱きかかえ
抱きかかう
抱きかかえる
抱きかかう
抱きかか
抱きかか
抱きか
抱きか
抱き
抱き
抱きる
抱く
抱くる
抱
抱
(Note that this word is 11 characters, so by Yomichan will not detect the full word with its default scan length of 10.)
Subsequently, every deinflection must be checked to see if it is in the database.
Have you found any open bugs about this at bugs.chromium.org?
I honestly didn't look into it that much; I just noticed Chrome getting slower with repeated testing. If I can verify it's fully reproducible on multiple versions of Chrome, I may start looking at bug reports. I also didn't stress test to see if this behaviour is on Firefox or other browsers.
For example, given
抱きかかえていなければ
, the following substrings are processed (possible deinflections are indented):
Right, I forgot that of course you can also inflect words indefinitely, like なくなくなくなく...ない
that becomes negative « negative « negative « ... « negative
. However, if you wanted to do precomputation in a way that would allow lookups to start from the beginning and switching to wildcard at some point, you could store all possible single inflections in the DB in a separate table and on matches perform inflection until the longest match instead of deinflection. I think that would reduce the amount of queries a lot, but also increase the DB size.
Given the problems with IndexedDB in general with deleting data and user dictionaries getting purged without notice, maybe it's better not to fix what isn't broken. After your previous DB optimizations it's already looking like DB isn't bottlenecking the operation too much in normal usage. I was thinking of the edge cases.
https://dev.to/skhmt/why-are-indexeddb-operations-significantly-slower-in-chrome-vs-firefox-1bnd https://bugs.chromium.org/p/chromium/issues/detail?id=233529 https://bugs.chromium.org/p/chromium/issues/detail?id=466053
Here's some more fun profiling: Doing IDBIndex.count
operations, with and without a range query, are orders of magnitude faster in Firefox on the same dataset.
Timing tests with only KANJIDIC and JMdict (English) installed:
Firefox timings:
Chrome timings:
This also seems to be in line with my "perceived speed" that text scanning takes in Firefox vs Chrome. In Firefox, rapidly moving the mouse with the scan key held has near instant lookup on every character the cursor moves over. In Chrome, it feels more laggy.
Thanks for the research. I followed some links and found that Mozilla has been against the competing Web SQL proposal which is now deprecated, and that could explain why it has invested more in IndexedDB. Maybe if it would make sense to create two classes for a "DB connector" where one of them uses Web SQL and another uses IndexedDB, because the insert slowdown doesn't seem to happen for that on Chrome when tested here.
Anyway, general usability still seems quite good to me on both browsers, and especially Firefox performance has improved compared to what it was before.
Anyway, general usability still seems quite good to me on both browsers, and especially Firefox performance has improved compared to what it was before.
Yeah, I'm not overly concerned with trying to fix it, but it's at least worth knowing about. One other thing I noticed while testing on Chrome is that Dexie will sometimes fail to purge the database due to the operation being "blocked". Not sure what the cause would be, but it seems that this results in the database still taking up the same amount of memory, even though inspection of the database shows that there are no entries and no dictionaries show up as installed.
One easy way to get some performance improvement would be to cache the results of apiTermsFind
here, using searchText
as the key:
https://github.com/FooSoft/yomichan/blob/83460bcdade28fe77908d3008444a23555f09487/ext/fg/js/frontend.js#L406-L418
Doing it in bg
would be better in terms of memory consumption and reusing the same cache on other pages, but there's also the cost of serializing data in backend and de-serializing in the frontend. Many devices would surely run out of memory after a while, so some form of cache eviction such as LRU or FIFO could be used.
The caches would have to be dropped after importing or deleting dictionaries.
One easy way to get some performance improvement would be to cache the results of apiTermsFind here, using searchText as the key:
I think I actually did that at some point a few months ago, but I never pushed it in because database lookups were sped up so much. It even had eviction criteria such as max lifetime and max size like you mentioned.
Haha, that's a funny coincidence. :smile: I did some testing and the speedup was noticeable when using more than one dictionary and enabling secondary searches for a dictionary, but I didn't notice much difference with English JMDict only.
Is it still infeasible to delete an individual dictionary? What about reordering dictionaries? I can create a new Github issue for that if there's interest.
Dictionaries are able to be deleted using the latest testing release, but this change hasn't been worked into a stable release yet. While it works, it's not a quick operation, and can slow down the browser window while the deletion is in progress.
As for reordering dictionaries, this can already (kind of) be accomplished using the priority system in the settings page. You have to enable the option "Show advanced options", and then each dictionary entry has a text field for priority. I'm not sure if this accomplishes what you want, so let me know if you mean something else.
Resolving since most improvements listed are complete.
This is intended to be an overview of various issues and tasks related to the database, as well as a venue for general discussion.
Database modifying operations
beforeunload
event to warn if the settings page is being closed while a database change operation is occurring (import, purge, delete, etc.). #272 This should help prevent corruptions/lingering data.Database health validation
Extended database functionality
IDBKeyRange.bound
. It may be possible to use this to some extent, but it is very limited. Maybe only enabling on the search page since it's likely to be a more expensive operation.Other
Database
class is in this weird state where half of it uses dexie and half of it uses the IndexedDB API. Might be better to just make everything use IndexedDB APIs for consistency.