birchill / 10ten-ja-reader

A browser extension to translate Japanese by hovering over words.
https://addons.mozilla.org/firefox/addon/10ten-ja-reader/
GNU General Public License v3.0
610 stars 45 forks source link

Long expressions #319

Open nicolasmaia opened 4 years ago

nicolasmaia commented 4 years ago

For some reason, Rikai can't recognize the expression 先生と言われるほどの馬鹿でなし.

It's been in JMdict since last year, though: https://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&e=2070202

birtles commented 4 years ago

Actually, ignore my previous comment. We have a hard-coded limit for how long a string we look up. It's currently set at 13 characters.

nicolasmaia commented 4 years ago

Can we extend it?

birtles commented 4 years ago

Sure. We should probably do some analysis on the database to see what the longest entries are.

There will be some performance impact too since we'll do a number of more lookups than we need to, but hopefully it's acceptable? It would be nice if we could do some performance comparison though.

SaltfishAmi commented 4 years ago
1       3193
2       58979
3       61481
4       75303
5       49891
6       39011
7       30063
8       21971
9       13759
10      8512
11      5391
12      3409
13      2072
14      1332
15      889
16      554
17      371
18      236
19      168
20      95
21      62
22      51
23      35
24      17
25      16
26      11
27      7
28      3
29      3
30      2
31      2
32      1
33      3
34      2
35      0
36      0
37      1

This is from the bundled dict.idx. Didn't find anything longer than 37 full width characters. The numbers include both kana-only and mixed kana-kanji entries.

birtles commented 4 years ago

Thanks! That's super useful.

I'm concerned that once I (finally) switch to storing the data in IndexedDB the extra lookup times for each substring are going produce a pretty significant performance impact.

Perhaps we could extend the limit to 16 for now? 19 at most?

SaltfishAmi commented 4 years ago

Thanks! That's super useful.

I'm concerned that once I (finally) switch to storing the data in IndexedDB the extra lookup times for each substring are going produce a pretty significant performance impact.

Perhaps we could extend the limit to 16 for now? 19 at most?

Or maybe this should be adjustable in the options page?

Lebon14 commented 4 years ago

Thanks! That's super useful. I'm concerned that once I (finally) switch to storing the data in IndexedDB the extra lookup times for each substring are going produce a pretty significant performance impact. Perhaps we could extend the limit to 16 for now? 19 at most?

Or maybe this should be adjustable in the options page?

This!

birtles commented 4 years ago

I've updated the hard-coded limit to 16 for now in: https://github.com/birtles/rikaichamp/commit/91846e16fc22c197f5c5188be595e8ef3304b5e8

I try really hard to avoid adding options so I'd prefer to wait until I can profile the performance difference properly and, if there's no noticeable impact, just increase the limit to 37.