Investigate why some variants are not taken into account

BoboTiG / ebook-reader-dict

Finally decent dictionaries based on the Wiktionary for your beloved eBook reader. Daily updates & 13 languages supported so far.

http://www.tiger-222.fr/?d=2020/04/17/22/14/21-un-dictionnaire-alternatif-et-complet-pour-votre-liseuse

MIT License

408 stars 24 forks source link

Investigate why some variants are not taken into account #1418

Closed BoboTiG closed 1 year ago

BoboTiG commented 1 year ago

Reported by Alex6.

In French, the word encalminés doesn't return definitions. Using our CLI, we can confirm what is available on the Wiktionary: encalminés has 2 variants (encalminé, and encalminer):

$ python -m wikidict fr --get-word encalminés 
encalminés \ɑ̃.kal.mi.ne\  

[variants] encalminer, encalminé

$ python -m wikidict fr --get-word encalminer
encalminer \ɑ̃.kal.mi.ne\  

De calme, avec le préfixe en-.

  1. Faire cesser, ou cesser, d’avancer, par manque de vent.
  2. (Figuré) Bloquer l’avancement, le progrès.

$ python -m wikidict fr --get-word encalminé 
encalminé \ɑ̃.kal.mi.ne\  

(Siècle à préciser) De calme avec les affixes en- et -iné.

  1. (Marine) Immobilisé, en parlant d’un voilier, par l’absence de vent ou à l’abri dans un havre.
  2. (Figuré) En parlant d'une personne : à l'arrêt, en attente de.
[variants] encalminer

I open the issue for more investigations, for instance I didn't check Kobo logs yet.

In such a situation, I would have thought both words to be displayed :thinking:

lasconic commented 1 year ago

To test, we can create a dictionary with these 3 words only.

lasconic commented 1 year ago

My first guess would be that we don't register encalminés because it's a plural and we hoped for Kobo to display encalminé automatically. Because of : https://pgaskin.net/dictutil/dicthtml/format.html and this statement:

It is not necessary to explicitly specify variants for things like plurals with a s suffix, as if a word cannot be found, the first prefix match will be used (i.e. tests will match a headword/variant named test). For this reason, words should be sorted from shortest to longest in alphabetical order in general (to prevent accidental matches of the wrong word).

BoboTiG commented 1 year ago

Arf we no longer have access to logs, Kobo is now encrypting them (cf https://www.mobileread.com/forums/showpost.php?p=4166564&postcount=5). What a shame.

Asked for some hints on https://github.com/pgaskin/dictutil/issues/20.

BoboTiG commented 1 year ago

Another use case:

$ python -m wikidict fr --get-word contingente                                      
contingente \kɔ̃.tɛ̃.ʒɑ̃t\ f. 

  1. (Désuet) (Chez Descartes) Variante de cotangente.
[variants] contingent, contingenter

So the Kobo will only display 1. (Désuet) (Chez Descartes) Variante de cotangente., but the awaited definition here is the one from "contigent".

Same issue with "frauduleuses" that doesn't display definitions (but it should go through "frauduleuse" -> "frauduleux", and display definitions from "fraudlueux").

Maybe could we improve that without having to rewrite the Kobo firmware :D

pgaskin commented 1 year ago

My first guess would be that we don't register encalminés because it's a plural and we hoped for Kobo to display encalminé automatically.

IIRC, I was only referring to English when I said that.

BoboTiG commented 1 year ago

Here are logs when looking for "encalminés":

Nov 10 14:50:25 nickel: (    16.265 @ 0x2a1a7f0 / ui.warning) static QByteArray Unzipper::extractFile(const QString&, const QString&) "/mnt/onboard/.kobo/dict/dicthtml-fr.zip" file not found "prefix_exceptions" 
Nov 10 14:50:36 nickel: (    27.046 @ 0x2a1a7f0 / dictionary.debug)  word:  "encalminés" 
Nov 10 14:50:36 nickel: (    27.215 @ 0x2a1a7f0 / dictionary.debug) got alternative search terms:  ("encalminés", "ENCALMINÉS", "Encalminés", "encalmins")  for word:  "encalminés"

It doesn't look for "encalminé" but "encalmins" :fearful:

BoboTiG commented 1 year ago

Using a custom dict created like:

$ python -m wikidict fr --gen-dict 'encalminer,encalminé,encalminés,encalminées' --output 'issue-1418'

we got that en.raw.html file:

<w>
    <p><a name="encalminé" /><b>encalminé</b> \ɑ̃.kal.mi.ne\<br /><br />
    <p>...</p>
    <var>
        <variant name="encalmines" />
    </var>
</w>

<w>
    <p><a name="encalminer" /><b>encalminer</b> \ɑ̃.kal.mi.ne\<br /><br />
    <p>...</p>
    <var>
        <variant name="encalmine" />
        <variant name="encalmines" />
        <variant name="encalminees" />
    </var>
</w>

</html>

We can see that accents were removed from variants. I'm a little bit rusty, is it expected @lasconic?

BoboTiG commented 1 year ago

Removing the call to https://github.com/BoboTiG/ebook-reader-dict/blob/fe3d7b430ea891f67450f6ccac1edf4e94c80ec4/wikidict/convert.py#L195-L200 fixes the issue.

It was added in #1227. Maybe could we call strip_accents() only for the Spanish dictionary?

BoboTiG commented 1 year ago

If we decided to apply the fix, "encalminées" would redirect to "encalminer" instead of both "encalminer", and "encalminé". And it's expected because on the Wiktionary "encalminées" only links to "encalminer". We would have to add a section on the Wiktionary to fix that (it they like such modifications though).

lasconic commented 1 year ago

About contingente, this is the expected behavior. If a word has a definition we don't use the variants at all.

About frauduleuses, the variant is "frauduleux" already.

python -m wikidict fr --get-word frauduleuses
frauduleuses \fʁɔ.dy.løz\  

[variants] frauduleux

And frauduleux has definitions so, it should be displayed.

If you run

python -m wikidict fr --gen-dict=frauduleux,frauduleuse,frauduleuses --output=test_wik

and check test_wik/tmp/fr.raw.html

lasconic commented 1 year ago

We can see that accents were removed from variants. I'm a little bit rusty, is it expected @lasconic?

The VARIANT must be normalized by trimming whitespace and lowercasing it (following unicode normalization rules, i.e. normalizing accented characters as well) According to https://pgaskin.net/dictutil/dicthtml/format.html again

lasconic commented 1 year ago

If you have a better behaviour without it, let's remove it for spanish as well.

BoboTiG commented 1 year ago

Hm but we can see that not normalizing the variant works (at least for the current example). And normalization was added sooner in the year, quite "late" in the project.

BoboTiG commented 1 year ago

It's just that those lines seem weird: https://github.com/BoboTiG/ebook-reader-dict/blob/fe3d7b430ea891f67450f6ccac1edf4e94c80ec4/wikidict/convert.py#L286-L287

guess_prefix() uses the non-normalized variant, while we store the normalized version in the dict. Maybe such behaviour is inefficient, or only necessary for spanish? :thinking:

BoboTiG commented 1 year ago

(Or maybe latest Kobo updates improved the "search" process, and normalization became obsolete.)

lasconic commented 1 year ago

Yes, I would get rid of the normalisation if it works for french and see if problems arise in spanish.

BoboTiG commented 1 year ago

Yes, I would get rid of the normalisation if it works for french and see if problems arise in spanish.

Let's see then :+1:

BoboTiG commented 1 year ago

About frauduleuses, the variant is "frauduleux" already.

Indeed. I don't know what I did, but it works like a charm.