Support <hiero> mediawiki extension

lasconic commented 3 years ago

Wiktionary page: https://fr.wiktionary.org/wiki/djed

Wikicode:

<hiero>R11</hiero>

Output:

R11

Expected:

Model link, if any: https://www.mediawiki.org/wiki/Extension:WikiHiero https://www.mediawiki.org/wiki/Special:MyLanguage/Extension:WikiHiero/Syntax https://github.com/wikimedia/mediawiki-extensions-wikihiero/blob/366b1226891e609650b4c7f7d925b718c779517c/includes/WikiHiero.php

BoboTiG commented 3 years ago

Should we handle it instead? It seems to be pictures.

lasconic commented 3 years ago

What do you mean ? convert the pictures to GIF and embed like we do for math ?

lasconic commented 3 years ago

BoboTiG commented 3 years ago

I did not have a look at the PHP file that is handling the template. But I guess it is "only" a bunch of files referenced by a key (here "R11"). IF it is that, we could handle it and use inline GIF as we do for math and chem, yes.

lasconic commented 3 years ago

Also https://fr.wiktionary.org/wiki/Ptah

BoboTiG commented 3 years ago

Pictures are there.

WDYT of displaying GIF for the template?

BoboTiG commented 3 years ago

It seems more like several GIFs for "Ptah". I do not know if it is worth handling the template. Let me know your thoughts :)

lasconic commented 3 years ago

It's a bit more complicated than just one GIF indeed. The extension outputs an HTML table and is able to put symbols on top of each other. To know if it's worth the pain..., I checked how many time hiero is used in the wikicode we currently render. In french, in 13 words (on 1,555,588)...

'Sekhmet'
'Apophis'
'Aton'
'Néfertiti'
'Pharaon'
'Ptah'
'Ramsès'
'djed'
'gomme'
'khépesh'
'oasis'
'ouchebti'
'uraeus'

63 in english, on 677,008 words

'barge'
'barque'
'basalt'
'Hathor'
'Hatshepsut'
'Hatti'
'Moab'
'Ab'
'Set'
'Shemu'
'Neith'
'Nephthys'
'Akhenaten'
'Akhet'
'Sobek'
'Anubis'
'Anuket'
'Sphinx'
'Onuphrius'
'Sutekh'
'Aswan'
'Imhotep'
'Thoth'
'Peret'
'Isis'
'Djahy'
'Jerusalem'
'Tutankhamon'
'Tutankhaten'
'Tybi'
'Unas'
'adobe'
'Wadjet'
'ba'
'Wenis'
'Punt'
'alphabet'
'Ra'
'ammonia'
'Re'
'Retjenu'
'ankh'
'ebony'
'Maat'
'emerald'
'ibis'
'life, prosperity, health'
'lightland'
'lily'
'heqat'
'hieroglyph'
'hin'
'natron'
'oasis'
'plewd'
'sphinx'
'senet'
'tjaty'
'serekh'
'stibium'
'uraeus'
'ushabti'
'trona'

Could be worth it, especially if most of them are sequential and "simple"...

lasconic commented 3 years ago

For french, here are the code.

S42-G17*X1-I12
O29 Q3:Q3 I14
i-t:n-N5
pr:aA
Q3:X1-V28-C19
ra:Z1-ms-s-sw
R11
N29-W19-M17-M17*X1-N33:Z2
Aa1:Q3-N37:F23-F51
Aa2-X1:N25
w-S-b-t:y-A53
I12

Some are simple like R11, but most of them contains * or : ... and it's less simple and would require a table or some css...

lasconic commented 3 years ago

Convert the PNG in GIF and store base64 in a map. Resulting file is 655KB.

import os
from PIL import Image 
from io import BytesIO
from base64 import b64encode

files = os.listdir(".")

results = {}
for f in files:
    if f.endswith(".png"):
        code = f.split("_", 1)[1].split(".")[0]
        png = Image.open(f) 
        im = BytesIO()
        png.convert("L").save(im, format="gif", optimize=True)
        im.seek(0)
        raw = im.read()
        results[code] = f'<img src="data:image/gif;base64,{b64encode(raw).decode()}"/>'

print("hiero = {")
for t, r in sorted(results.items()):
    print(f'    "{t}": \'{r}\',')
print(f"}}  # {len(results):,}")

lasconic commented 3 years ago

In short, we probably need to reproduce the whole PHP scripts to have a decent support.

In particular the tokenizer, https://github.com/wikimedia/mediawiki-extensions-wikihiero/blob/366b1226891e609650b4c7f7d925b718c779517c/includes/HieroTokenizer.php and the render function at https://github.com/wikimedia/mediawiki-extensions-wikihiero/blob/366b1226891e609650b4c7f7d925b718c779517c/includes/WikiHiero.php#L259

Also some hiero code uses phonemes and not the code used in the PNG filename. So we need a copy of https://github.com/wikimedia/mediawiki-extensions-wikihiero/blob/366b1226891e609650b4c7f7d925b718c779517c/includes/WikiHiero.php#L259

It will be hard to unit test the output, since it's only img tag with base64 and a bunch of HTML...

A bit too much for a sunday :)

BoboTiG commented 3 years ago

Clearly too much, yes :)

Thanks for the analysis and pre-work ;)

lasconic commented 3 years ago

WIP https://github.com/lasconic/ebook-reader-dict/tree/fix-703-hiero

BoboTiG commented 3 years ago

Nice one!

BoboTiG commented 3 years ago

I was wondering what do you think about your patch? Worth giving a try on my side?

lasconic commented 3 years ago

It's kind of linked with the HTML table one https://github.com/BoboTiG/ebook-reader-dict/issues/1024, since table support is needed. So I would tackle HTML table first to get some info on how well it works on kobo before tackling this one.

lasconic commented 3 years ago

Attached a dictionary containing the french words with hiero from https://github.com/BoboTiG/ebook-reader-dict/issues/703#issuecomment-778651324

dicthtml-fr.zip

BoboTiG commented 3 years ago

C'est propre !

I think the cell width should be adapted to the picture width it contain. See https://fr.wiktionary.org/wiki/Rams%C3%A8s for example:

The 1st column, 2nd picture is taking the whole with and is deformed.
The 3rd column is too large.

But we can live as-is :+1:

BoboTiG commented 3 years ago

https://fr.wiktionary.org/wiki/Sekhmet is not really well displayed too.

lasconic commented 3 years ago

Yes, I feel like I'm pushing the limit of the HTML renderer on the Kobo... Here is Sekhmet in Chrome (rendered bigger to be the right size on Kobo...) Somehow the styling in the Kobo browser is not the same... (do we know which renderer it is ? Probably webkit, but which version ?) Maybe it's not the browser but a default CSS applied to table... Any idea if we can see this CSS somewhere ?

Capture 2021-08-23 à 19 54 22

and Ramsès

Capture 2021-08-23 à 19 57 21

BoboTiG commented 3 years ago

I could go up to https://github.com/kobolabs/qt-everywhere-opensource-src-4.6.2/blob/master/src/3rdparty/webkit/VERSION to find the WebKit version, but the hash is not helpfull (69dd29fbeb12d076741dce70ac6bc155101ccd6f, I could not find it). Given the [changelog](), it is an old one from 2009-11-30. That mirror has a history until 2012 only.

And I am not sure about those information, I got the 4.6.2 version of Qt Embedded from the latest Kobo firmware (https://kbdownload1-a.akamaihd.net/firmwares/kobo7/Feb2021/kobo-update-4.26.16704.zip), so it should be right.

lasconic commented 3 years ago

Ok, so if they use webkit to do dictionary rendering, it's the one included in Qt 4.6.2.

I investigated the style... I believe I found the problem for Ramsès, not yet for Sekhmet

New french dictionary: dicthtml-fr.zip

BoboTiG commented 3 years ago

About the default CSS, I cannot say it is used in the dictionary area though:

* {padding: 0; margin: 0; }
body { font: %1px %2; }
table, thead, tbody, tr, td, th { font-size: inherit; font-family: inherit; }

(still looking for more data)

lasconic commented 3 years ago

Interesting page for testing : https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WikiHiero/Exemples

BoboTiG commented 3 years ago

The new version is way better :muscle: The rendering is great!

BoboTiG commented 3 years ago

https://fr.wiktionary.org/wiki/Aton needs more space in column 2. Maybe it is a vertical alignment issue like for Sekhmet. https://fr.wiktionary.org/wiki/Ptah and https://fr.wiktionary.org/wiki/gomme also.

BoboTiG / ebook-reader-dict

Support <hiero> mediawiki extension #703