Closed BoboTiG closed 1 year ago
To test, we can create a dictionary with these 3 words only.
My first guess would be that we don't register encalminés because it's a plural and we hoped for Kobo to display encalminé automatically. Because of : https://pgaskin.net/dictutil/dicthtml/format.html and this statement:
It is not necessary to explicitly specify variants for things like plurals with a s suffix, as if a word cannot be found, the first prefix match will be used (i.e. tests will match a headword/variant named test). For this reason, words should be sorted from shortest to longest in alphabetical order in general (to prevent accidental matches of the wrong word).
Arf we no longer have access to logs, Kobo is now encrypting them (cf https://www.mobileread.com/forums/showpost.php?p=4166564&postcount=5). What a shame.
Asked for some hints on https://github.com/pgaskin/dictutil/issues/20.
Another use case:
$ python -m wikidict fr --get-word contingente
contingente \kɔ̃.tɛ̃.ʒɑ̃t\ f.
1. (Désuet) (Chez Descartes) Variante de cotangente.
[variants] contingent, contingenter
So the Kobo will only display 1. (Désuet) (Chez Descartes) Variante de cotangente.
, but the awaited definition here is the one from "contigent".
Same issue with "frauduleuses" that doesn't display definitions (but it should go through "frauduleuse" -> "frauduleux", and display definitions from "fraudlueux").
Maybe could we improve that without having to rewrite the Kobo firmware :D
My first guess would be that we don't register encalminés because it's a plural and we hoped for Kobo to display encalminé automatically.
IIRC, I was only referring to English when I said that.
Here are logs when looking for "encalminés":
Nov 10 14:50:25 nickel: ( 16.265 @ 0x2a1a7f0 / ui.warning) static QByteArray Unzipper::extractFile(const QString&, const QString&) "/mnt/onboard/.kobo/dict/dicthtml-fr.zip" file not found "prefix_exceptions"
Nov 10 14:50:36 nickel: ( 27.046 @ 0x2a1a7f0 / dictionary.debug) word: "encalminés"
Nov 10 14:50:36 nickel: ( 27.215 @ 0x2a1a7f0 / dictionary.debug) got alternative search terms: ("encalminés", "ENCALMINÉS", "Encalminés", "encalmins") for word: "encalminés"
It doesn't look for "encalminé" but "encalmins" :fearful:
Using a custom dict created like:
$ python -m wikidict fr --gen-dict 'encalminer,encalminé,encalminés,encalminées' --output 'issue-1418'
we got that en.raw.html
file:
<w>
<p><a name="encalminé" /><b>encalminé</b> \ɑ̃.kal.mi.ne\<br /><br />
<p>...</p>
<var>
<variant name="encalmines" />
</var>
</w>
<w>
<p><a name="encalminer" /><b>encalminer</b> \ɑ̃.kal.mi.ne\<br /><br />
<p>...</p>
<var>
<variant name="encalmine" />
<variant name="encalmines" />
<variant name="encalminees" />
</var>
</w>
</html>
We can see that accents were removed from variants. I'm a little bit rusty, is it expected @lasconic?
Removing the call to https://github.com/BoboTiG/ebook-reader-dict/blob/fe3d7b430ea891f67450f6ccac1edf4e94c80ec4/wikidict/convert.py#L195-L200 fixes the issue.
It was added in #1227. Maybe could we call strip_accents()
only for the Spanish dictionary?
If we decided to apply the fix, "encalminées" would redirect to "encalminer" instead of both "encalminer", and "encalminé". And it's expected because on the Wiktionary "encalminées" only links to "encalminer". We would have to add a section on the Wiktionary to fix that (it they like such modifications though).
About contingente, this is the expected behavior. If a word has a definition we don't use the variants at all.
About frauduleuses, the variant is "frauduleux" already.
python -m wikidict fr --get-word frauduleuses
frauduleuses \fʁɔ.dy.løz\
[variants] frauduleux
And frauduleux has definitions so, it should be displayed.
If you run
python -m wikidict fr --gen-dict=frauduleux,frauduleuse,frauduleuses --output=test_wik
and check test_wik/tmp/fr.raw.html
We can see that accents were removed from variants. I'm a little bit rusty, is it expected @lasconic?
The VARIANT must be normalized by trimming whitespace and lowercasing it (following unicode normalization rules, i.e. normalizing accented characters as well) According to https://pgaskin.net/dictutil/dicthtml/format.html again
If you have a better behaviour without it, let's remove it for spanish as well.
Hm but we can see that not normalizing the variant works (at least for the current example). And normalization was added sooner in the year, quite "late" in the project.
It's just that those lines seem weird: https://github.com/BoboTiG/ebook-reader-dict/blob/fe3d7b430ea891f67450f6ccac1edf4e94c80ec4/wikidict/convert.py#L286-L287
guess_prefix()
uses the non-normalized variant, while we store the normalized version in the dict. Maybe such behaviour is inefficient, or only necessary for spanish? :thinking:
(Or maybe latest Kobo updates improved the "search" process, and normalization became obsolete.)
Yes, I would get rid of the normalisation if it works for french and see if problems arise in spanish.
Yes, I would get rid of the normalisation if it works for french and see if problems arise in spanish.
Let's see then :+1:
About frauduleuses, the variant is "frauduleux" already.
Indeed. I don't know what I did, but it works like a charm.
Reported by Alex6.
In French, the word encalminés doesn't return definitions. Using our CLI, we can confirm what is available on the Wiktionary: encalminés has 2 variants (encalminé, and encalminer):
I open the issue for more investigations, for instance I didn't check Kobo logs yet.
In such a situation, I would have thought both words to be displayed :thinking: