BoboTiG / ebook-reader-dict

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.
http://www.tiger-222.fr/?d=2020/04/17/22/14/21-un-dictionnaire-alternatif-et-complet-pour-votre-liseuse
MIT License
386 stars 21 forks source link

Arabic and Persian scripts not rendered in italic #584

Closed BoboTiG closed 1 year ago

BoboTiG commented 3 years ago

Wikicode:

De l’{{étyl|ar|fr|mot=طس|tr=tass|sens=[[coupe]], [[écuelle]]}}, lui-même, selon certains, [[emprunté]] au {{étyl|fa|fr|mot=تشت|tr=tašt|sens=tasse}}, [[soucoupe]], ce qui serait inexact selon {{w|Henri Lammens|Lammens}}. Les dérivés''تشت tašt'' et ''طست tast'', respectivement persan et arabe, sont du même radical « de haute antiquité » arabe: ''طس tas'''.

Here, arabic characters are well printed when they are handled by the étyl template. But when those characters are part of the "normal" text, their representation is broken. I mean:

# OK
{{étyl|ar|fr|mot=طس|tr=tass|sens=[[coupe]], [[écuelle]]}}

# BAD
''طس tas''

I will post a screenshot in coming days to demonstrate the issue.

BoboTiG commented 3 years ago

Here is the screenshot, I do not know yet why this renders badly.

screen_001

lasconic commented 3 years ago

It would be very cool to have a command to :

BoboTiG commented 3 years ago

Ah yes! Like --gen-dict WORD [WORD...]. :+100

lasconic commented 3 years ago

See #666

Can you reproduce with this dictionary ? dicthtml-fr.zip

BoboTiG commented 3 years ago

--gen-dict is a killer feature, what an idea! :)

BoboTiG commented 3 years ago

Same result with that dict.

lasconic commented 3 years ago

The HTML looks good. I believe it's the italic. Freeserif doesn't have italic arabic or the webkit version on the Kobo doesn't handle italic correctly for these scripts... Not sure italic is really used in arabic or persian.

BoboTiG commented 3 years ago

Good catch.

lasconic commented 3 years ago

Yes, no arabic in freeserif italic https://fonts2u.com/free-serif-italic.font

lasconic commented 3 years ago

I'm not sure if we can do much more. I couldn't find a unicode font with italic arabic.

BoboTiG commented 3 years ago

If it was only "tasse" I would have changed the formatting on the Wiktionary. But I doubt it is the only one.

lasconic commented 3 years ago

Can we find them somehow ? Look for ''+arabic letter in the wikicode dump? or in render.py look for <i>+arabic before writing the definitions?

lasconic commented 3 years ago
<i>[^<]*([\u0627-\u064a]+)[^<]*</i>

Can probably be improved to capture the arabic word and move it out of <i></i>...

lasconic commented 3 years ago

I could detect a few: https://gist.github.com/lasconic/56762057597b1eaa8c0465ab89c4dc22

added the following in parse_word in render.py and ran --render


    regex = r"<i>[^<]*([\u0627-\u064a]+)[^<]*</i>"
    def check_arabic(definition: str):
        matches = re.findall(regex, definition)
        if matches:
            print("####ERROR arabic in italic in definition :" + word, flush=True)
            print(definition, flush=True)
            print(matches, flush=True)

    for definition in definitions:
        if isinstance(definition, tuple):
            for subdef in definition:
                if isinstance(subdef, tuple):
                    for subsubdef in subdef:
                        check_arabic(subsubdef)
                else:
                    check_arabic(subdef)
        else:
            check_arabic(definition)

    if etymology:
        matches = re.findall(regex, etymology)
        if matches:
            print("####ERROR arabic in italic in etymology :" + word, flush=True)
            print(etymology, flush=True)
            print(matches, flush=True)