FooSoft / yomichan

Japanese pop-up dictionary extension for Chrome and Firefox.
https://foosoft.net/projects/yomichan
Other
1.05k stars 205 forks source link

Structured content import efficiency issues? #1853

Open Thermospore opened 3 years ago

Thermospore commented 3 years ago

Hello, I am working on converting a dictionary to yomichan which makes heavy use of structured content. The source data for the dictionary is in HTML and uses lots of <div> and <sub> tags in particular, which I have more or less maintained

When you try to import the dictionary, the tab freezes for a few minutes at 0% (presumably while checking the validity of the structured content?) before the import starts. The browser often gives a message saying the tab is unresponsive/crashed and asks if you want to close it, but if you keep waiting the dictionary will eventually continue importing as normal

I'm using chrome, but someone using firefox reported the dictonary was stuck importing at 0% for 40 minutes before continuing lol

As a test, I tried forcing the dictionary to plain text. That version doesn't freeze at 0% all, leading me to believe the issue is due to the structured content

Here is the dictionary

and the plaintext test version

Is there anything that can be done to improve the import speed/experience? Whether that be on my end or on yomichan's end

Thanks for taking a look!!

shoui520 commented 3 years ago

also, it seems to be using only 1 cpu core.

toasted-nutbread commented 3 years ago

The validation process is probably not well optimized for lots of structured content. The freeze is due to the validation step, which occurs entirely before import progress begins, and this freeze is also noticeable on large plaintext dictionaries, maybe 5-30 seconds depending on size. It is effectively a JSON schema validation step on a giant input.

With regards to import speed, it would have to be a change on the Yomichan side, unless you can optimize the structure of your content (omitting divs/spans that aren't necessary).

also, it seems to be using only 1 cpu core.

This is expected.


Additional comment: importing can take significantly longer on mobile browsers, and I could easily see a single dictionary taking ~40 minutes to import if the data is massive. Part of this is due to the speed of the database operations (slow), and part of this is the validation step (CPU will in general be slower than desktop).

The validation step is generalized to use a generic JSON schema, and it is also therefore slower than a highly optimized rewrite. The tradeoff here is that while it may be slower, updating the JSON schema does not require any updates to the codebase.

toasted-nutbread commented 3 years ago

One other point of consideration: in addition to JSON validation, structured content must be parsed for images during import.

toasted-nutbread commented 3 years ago

You can also test the validation process outside of the browser using node and one of the dev scripts in this repository:

node dev/dictionary-validate.js path/to/dictionary.zip

For reference, validating the dictionary in #1854 took about 7 minutes.

Thermospore commented 3 years ago

Thanks for taking a look and thanks for the info!

During validation, is it possible to keep the tab responsive and/or give some indication that progress is being made? Otherwise users might think the import has failed/crashed

Thermospore commented 3 years ago

In regards to speed, for reference: for complete validation + import, it takes about 3.5min on my nice pc (Ryzen 9 5900X) and about 5min on my laptop (i5-9400H). I suppose I'm fine with that order of magnitude, especially since importing the dict is generally something you do just once

I can probably shave some of that off by doing some clean up, as you mention. I could also try rendering the divs to plain text \ns, which would remove a large amount of structured content. Could also give splitting up the term bank a shot

Hopefully that 40 minute import was just a fluke, but I guess it's a wait and see

toasted-nutbread commented 3 years ago

During validation, is it possible to keep the tab responsive and/or give some indication that progress is being made? Otherwise users might think the import has failed/crashed

I will probably multithread some parts of the import process, as that should be the easiest way to provide non-blocking progress updates without having to async'ify the entire validation process (which would make it even more slow).

toasted-nutbread commented 2 years ago

1868 shows additional progress for all steps of the import process, including validation. The import process has also been moved to a separate thread. This should improve some of the responsiveness issues you mentioned, but doesn't necessarily make it any faster.

I tested on the dictionary you provided in the other issue and it only took around 5 minutes.

Thermospore commented 2 years ago

Just tried it out; looks quite nice. Thanks!

I'll see if that changed anything for the person it took 40 mins for