BoboTiG / ebook-reader-dict

Finally decent dictionaries based on the Wiktionary for your beloved eBook reader. Daily updates & 14 languages supported so far.
http://www.tiger-222.fr/?d=2020/04/17/22/14/21-un-dictionnaire-alternatif-et-complet-pour-votre-liseuse
MIT License
414 stars 24 forks source link

PyGlossary conversion errors (missing images) #1183

Closed Moonbase59 closed 2 years ago

Moonbase59 commented 2 years ago

Note from @BoboTiG: issue tightly coupled to #1182, interesting details can be found there too.


I just downloaded, parsed and rendered the EN Wiktionary, and it apparently has some problems with erroneous and/or missing GIFs:

output.txt

All of the .gif files in data/en/res appear to be very ugly rendered fomulae (?).

BoboTiG commented 2 years ago

@ilius I saw those errors since the beginning. As files are created by PyGlossary, is it expected to have such errors?

ilius commented 2 years ago

What version of PyGlossary are you using? Try again the latest tag or main branch.

BoboTiG commented 2 years ago

We are running the latest version. FTR those errors were always present, here is an example 2 months ago: https://github.com/BoboTiG/ebook-reader-dict/runs/4343135917?check_suite_focus=true

BoboTiG commented 2 years ago

I just did not take them into account, too lazy :) But they may be important or minor, and then maybe the exception could be silent.

ilius commented 2 years ago

Please make sure ~/.cache/pyglossary/ exists, or try again with latest tag.

BoboTiG commented 2 years ago

I see https://github.com/ilius/pyglossary/commit/ea1ddf6d58529f212b3a6bcf96394fe09490d145 👍 :)

If the next version does not fix errors, I'll have a look and report any potential improvement/bug to the PyGlossary repo.

ilius commented 2 years ago

I see ilius/pyglossary@ea1ddf6 👍 :)

That's not a bug fix. 4.4.1 should work too.

Moonbase59 commented 2 years ago

@ilius: True that pyglossary generates the GIFs? If yes, ever thought of generating 8-bit grayscale+alpha PNGs instead? They aren’t much bigger but might provide cleaner output.

And would you know if that’s supported by readers?

BoboTiG commented 2 years ago

Creating ~/.cache/pyglossary/dict-de-de.df_res(specific to that call: python -m wikidict de --convert) does not silent errors. I'll dig deeper when I find time.

ilius commented 2 years ago

U-huh, got it! https://github.com/ilius/pyglossary/commit/ecf386b80aa24d34a8dc4f31c13b2eeb79260cd3 That was one of weirdest bug I ever encountered.

ilius commented 2 years ago

I can add an option to convert gif to png if you want.

Moonbase59 commented 2 years ago

GIF→PNG wouldn’t help much, I think. One of the problems is that the GIF already has a white background, which looks odd in readers using a background color (like GoldenDict). Who/what creates the images in the first place?

As long as we’re generating an HTML dict, it might be even better to generate an SVG (for formulae; with a size), but I’m rather unsure about SVG rendering support in dicts. Then again, a reader would typically use its HTML renderer for that, so we might be lucky.

BoboTiG commented 2 years ago

Actually, on Kobo there is no background color. Here is an example: cercle unité. I checked with the dark mode enabled, and still no background displayed.

I think we are talking about 2 kinds of GIFs, ones generated by the current project (<math>, <chem>, and some hieroglyphs): https://github.com/BoboTiG/ebook-reader-dict/blob/794a7236d46fd91f57cd52c8fe428c635f695ae1/wikidict/utils.py#L488-L503

And ones created by PyGlossary.

The former is using embedded GIFs as <img src="data:image/gif;base64,..."/>'. The later is taking that information and turns it into real GIFs.

Might be worth looking at how PyGlossary is creating those files, maybe is there something to tweak?

Moonbase59 commented 2 years ago

Ah, interesting. Screenshot from an actual reader? It looks way better than on my GoldenDict (which does use a yellowish background, and thus we get "white blocks"). Too bad my Tolinos support none of the formats we currently generate. Must give KOReader a spin, I guess.

Where do the project-generated GIFs come from? I really wonder if SVG could be done (for scaling on any device/device resolution).

BoboTiG commented 2 years ago

Here is the same work using dark theme:

screen_001

And yes, it is the real screenshot on the Kobo Libra H2O.

BoboTiG commented 2 years ago

SVG would be a killer feature, indeed. Not sure about the support though.

Moonbase59 commented 2 years ago

Looks like PIL only handles raster-type images. We might be able to use 'LA', though (8-bit grayscale+alpha).

Moonbase59 commented 2 years ago

I haven’t yet installed FR, can I use a fast command to get only "cercle unitĂ©" in a dict, for comparison?

BoboTiG commented 2 years ago

I haven’t yet installed FR, can I use a fast command to get only "cercle unitĂ©" in a dict, for comparison?

Of course:

mkdir  test_wik
python -m wikidict fr --gen-dict='cercle unité' --output=test_wik

The resulting dict can then be found inside the test_wik folder.

You can adapt the command to use a German word (or English one like graph).

Moonbase59 commented 2 years ago

Just love this project for providing a well thought-out foundation! What steps go before? Download/parse/render?

BoboTiG commented 2 years ago

This is a sigle step, we introduced it to help debugging such issues ;)

lasconic commented 2 years ago

And since a couple hours ago

python -m wikidict fr --gen-dict='cercle unité' --output=test_wik --format=stardict

To get a stardict file, instead of kobo.

Moonbase59 commented 2 years ago

But it will use an already downloaded dump, right? 'cause I’m just downloading FR :-)

lasconic commented 2 years ago

No, it gets the wiki code directly from the web for this article only (or the articles if you pass a comma separated list of words, just like get-word btw)

Moonbase59 commented 2 years ago

Wow! Ok, let me abort and try. D/Ling FR can be done later then.

Moonbase59 commented 2 years ago

Oops:

matthias@e6510:~/Projekte/ebook-reader-dict$ python3 -m wikidict fr --gen-dict='cercle unité' --output=test_wik --format=stardict
>>> Generated dict-fr-fr.df (4,595 bytes)

Traceback (most recent call last):
  File "/home/matthias/.local/lib/python3.8/site-packages/pyglossary/glossary.py", line 905, in _read
    reader.open(filename)
  File "/home/matthias/.local/lib/python3.8/site-packages/pyglossary/plugins/ebook_kobo_dictfile.py", line 71, in open
    TextGlossaryReader.open(self, filename)
  File "/home/matthias/.local/lib/python3.8/site-packages/pyglossary/text_reader.py", line 84, in open
    self._open(filename)
  File "/home/matthias/.local/lib/python3.8/site-packages/pyglossary/text_reader.py", line 80, in _open
    self.loadInfo()
  File "/home/matthias/.local/lib/python3.8/site-packages/pyglossary/text_reader.py", line 131, in loadInfo
    self._pendingEntries.append(self.newEntry(word, defi))
  File "/home/matthias/.local/lib/python3.8/site-packages/pyglossary/text_reader.py", line 113, in newEntry
    return self._glos.newEntry(
  File "/home/matthias/.local/lib/python3.8/site-packages/pyglossary/glossary.py", line 742, in newEntry
    return Entry(
  File "/home/matthias/.local/lib/python3.8/site-packages/pyglossary/entry.py", line 285, in __init__
    raise TypeError(f"invalid defi type {type(defi)}")
TypeError: invalid defi type <class 'tuple'>
Reading file 'test_wik/dict-fr-fr.df' failed.
>>> Generated dict-fr-fr.zip (22 bytes)
matthias@e6510:~/Projekte/ebook-reader-dict$ 
BoboTiG commented 2 years ago

I hit the issue too, I am currently looking into it :watch:

Moonbase59 commented 2 years ago

The .df looks ok, but the zip is empty.

Looks like a pyglossary bug. I’m using 4.4.1 and tried to convert the .df manually.

ilius commented 2 years ago

https://github.com/ilius/pyglossary/commit/e864fa4cd29bcba024dc10e6b93eda259c228449 Please try again with latest master.

Moonbase59 commented 2 years ago

Okayyy
 Next dumb question: How would I install the latest master over my PIP3-installed pyglossary? So that is is globally available (in the path), and can also do --ui=gtk?

ilius commented 2 years ago

sudo python3 setup.py install or python3 setup.py install --user

BoboTiG commented 2 years ago

FTR I added a test case to reproduce the current error:

# from up-to-date master branch
$ python -m pytest tests/test_5_gen_dict.py -k cercle
BoboTiG commented 2 years ago

It seems that DictFile is causing issues with PyGlossary:

@ cercle unité
: \sɛʁ.kl‿y.ni.te\  <i>m.</i>
<html><p>Des mots <i>cercle</i>, figure géométrique, et <i>unité</i>.</p><br />
<ol><li><i>(MathĂ©matiques)</i></li><ol style="list-style-type:lower-alpha"><li>On appelle cercle unitĂ© de <img style="height:100%;max-height:0.8em;width:auto;vertical-align:bottom" src=""/>, l’ensemble des nombres complexes de module Ă©gal Ă  1 : <img style="height:100%;max-height:0.8em;width:auto;vertical-align:bottom" src=""/>. <br> Il apparait alors clairement que <img style="height:100%;max-height:0.8em;width:auto;vertical-align:bottom" src=""/>.</li><li>De mĂȘme, on appelle cercle unitĂ© de <img style="height:100%;max-height:0.8em;width:auto;vertical-align:bottom" src=""/>, l’ensemble <img style="height:100%;max-height:0.8em;width:auto;vertical-align:bottom" src=""/>.</li></ol></ol>
Moonbase59 commented 2 years ago

Ok, did a:

pip3 uninstall pyglossary
git clone https://github.com/ilius/pyglossary.git
cd pyglossary
python3 setup.py install --user

Ran command above:

python3 -m wikidict fr --gen-dict='cercle unité' --output=test_wik --format=stardict

Output:

matthias@e6510:~/Projekte/ebook-reader-dict$ python3 -m wikidict fr --gen-dict='cercle unité' --output=test_wik --format=stardict
>>> Generated dict-fr-fr.df (4,595 bytes)
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/matthias/Projekte/ebook-reader-dict/wikidict/__main__.py", line 122, in <module>
    sys.exit(main())
  File "/home/matthias/Projekte/ebook-reader-dict/wikidict/__main__.py", line 101, in main
    return gen_dict.main(
  File "/home/matthias/Projekte/ebook-reader-dict/wikidict/gen_dict.py", line 25, in main
    run_formatter(StarDictFormat, *args)
  File "/home/matthias/Projekte/ebook-reader-dict/wikidict/convert.py", line 399, in run_formatter
    formater.process()
  File "/home/matthias/Projekte/ebook-reader-dict/wikidict/convert.py", line 356, in process
    self._convert()
  File "/home/matthias/Projekte/ebook-reader-dict/wikidict/convert.py", line 327, in _convert
    Glossary.init()
  File "/home/matthias/.local/lib/python3.8/site-packages/pyglossary/glossary.py", line 1153, in init
    cls.loadPluginsFromJson(pluginsJsonPath)
  File "/home/matthias/.local/lib/python3.8/site-packages/pyglossary/plugin_manager.py", line 53, in loadPluginsFromJson
    with open(jsonPath) as _file:
FileNotFoundError: [Errno 2] No such file or directory: '/home/matthias/.local/lib/python3.8/site-packages/plugins-meta/index.json'
Traceback locals:
    cls = <class 'pyglossary.glossary.Glossary'>
    jsonPath = '/home/matthias/.local/lib/python3.8/site-packages/plugins-met...
    len(jsonPath) = 73
    json = <module 'json' from '/usr/lib/python3.8/json/__init__.py'>
    dirname = <function dirname at 0x7f479ae67820>
    join = <function join at 0x7f479ae67550>

Also:

matthias@e6510:~/Projekte/ebook-reader-dict$ pyglossary --ui=gtk
[CRITICAL] Traceback (most recent call last):
  File "/home/matthias/.local/bin/pyglossary", line 6, in <module>
    main()
  File "/home/matthias/.local/lib/python3.8/site-packages/pyglossary/ui/main.py", line 575, in main
    Glossary.init()
  File "/home/matthias/.local/lib/python3.8/site-packages/pyglossary/glossary.py", line 1153, in init
    cls.loadPluginsFromJson(pluginsJsonPath)
  File "/home/matthias/.local/lib/python3.8/site-packages/pyglossary/plugin_manager.py", line 53, in loadPluginsFromJson
    with open(jsonPath) as _file:
FileNotFoundError: [Errno 2] No such file or directory: '/home/matthias/.local/lib/python3.8/site-packages/plugins-meta/index.json'
ilius commented 2 years ago

Please pull and try again.

ilius commented 2 years ago

Please use pip install . -U instead

BoboTiG commented 2 years ago

That's better :+1:

$ python -m wikidict de --convert                                                    
>>> Loading data/de/data-20220120.json ...
>>> Loaded 133,008 words from data/de/data-20220120.json
>>> Generated dict-de-de.df (34,932,810 bytes)
>>> Generated dicthtml-de-de.zip (10,860,965 bytes)
No module named 'pyglossary.plugin_lib.py310'
error in DataEntry.save: [Errno 2] No such file or directory: '/home/tiger-222/.cache/pyglossary/dict-de-de.df_res/39280735.gif'
error in DataEntry.save: [Errno 2] No such file or directory: '/home/tiger-222/.cache/pyglossary/dict-de-de.df_res/39280735.gif'
error in DataEntry.save: [Errno 2] No such file or directory: '/home/tiger-222/.cache/pyglossary/dict-de-de.df_res/490fdc4a.gif'
error in DataEntry.save: [Errno 2] No such file or directory: '/home/tiger-222/.cache/pyglossary/dict-de-de.df_res/76993ec3.gif'
error in DataEntry.save: [Errno 2] No such file or directory: '/home/tiger-222/.cache/pyglossary/dict-de-de.df_res/76993ec3.gif'
error in DataEntry.save: [Errno 2] No such file or directory: '/home/tiger-222/.cache/pyglossary/dict-de-de.df_res/ba1f03ff.gif'
error in DataEntry.save: [Errno 2] No such file or directory: '/home/tiger-222/.cache/pyglossary/dict-de-de.df_res/f6d31a88.gif'
>>> Generated dict-de-de.zip (10,590,910 bytes)
BoboTiG commented 2 years ago

And ever better :muscle:

$ python -m wikidict fr --gen-dict='cercle unité' --output=test_wik --format=stardict
>>> Generated dict-fr-fr.df (4,595 bytes)
No module named 'pyglossary.plugin_lib.py310'
>>> Generated dict-fr-fr.zip (4,246 bytes)
BoboTiG commented 2 years ago

error in DataEntry.save: [Errno 2] No such file or directory: '/home/tiger-222/.cache/pyglossary/dict-de-de.df_res/39280735.gif'

BTW @ilius is it expected that the GIF is not found?

ilius commented 2 years ago

BTW @ilius is it expected that the GIF is not found?

https://github.com/ilius/pyglossary/commit/11b2c3a2ede3a8efde6ce2d7ccc1a424b1ba3bec Please try again. Should not see that error again.

ilius commented 2 years ago

Is there a way to set the number of workers in --render?

BoboTiG commented 2 years ago

BTW @ilius is it expected that the GIF is not found?

ilius/pyglossary@11b2c3a Please try again. Should not see that error again.

Works perfectly, thanks!

BoboTiG commented 2 years ago

Is there a way to set the number of workers in --render?

Not yet. I'm on it (cf #1199)!

BoboTiG commented 2 years ago

Is there a way to set the number of workers in --render?

@ilius , you are good to go: --render --workers=N :heavy_check_mark:

Moonbase59 commented 2 years ago

Did a quick one using fresh pulls: Fast, no errors on:

$ python3 -m wikidict fr --gen-dict='cercle unité' --output=test_wik --format=stardict
>>> Generated dict-fr-fr.df (4,595 bytes)
>>> Generated dict-fr-fr.zip (4,246 bytes)

Using ilius’ GoldenDict theme, we can see why raster images are bad, especially w/o transparancy:

cercle unité - GoldenDict_001

Will be trying the --workers now. Yesterday got a load of 14 (!) on a quad-core laptop (8 threads).

ilius commented 2 years ago

@ilius , you are good to go: --render --workers=N ✔

Thanks. I'm still not sure how multiprocessing.Pool() works. For example when I pass --workers=2, it results in Pool(processes=2) correctly. But I can see 10 processes (PIDs) (9 children), only 2 of them running at the same time (the rest are sleep).

Moonbase59 commented 2 years ago

Using workers=4 here, it jumps between 2 and 4 active. Most of the time, all 4 are active.

It’s a memory hog, of course—eating up all 8 GB RAM on my laptop, plus 1 GB of swap.

Moonbase59 commented 2 years ago

Possible that there’s something wrong with that still?

I could previously (no workers) generate a complete dict, although it would use almost all resources on my laptop. Using workers=4 now, it produces full RAM (122 MB left of 8 GB, 0 bytes left on swap), and swaps itself to death (load average above 40!), had to pull the plug.

Moonbase59 commented 2 years ago

I see workers+1 python3 processes in top, each reserving 2.4 GB RAM. Trying workers=2 now.

ilius commented 2 years ago

Can we close this issue?