BoboTiG / ebook-reader-dict

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.
http://www.tiger-222.fr/?d=2020/04/17/22/14/21-un-dictionnaire-alternatif-et-complet-pour-votre-liseuse
MIT License
391 stars 21 forks source link

New locale: DE #1161

Closed Moonbase59 closed 2 years ago

Moonbase59 commented 2 years ago

My goal is to have (and share) a good German Wiktionary-based dictionary that displays well on small e-reader screens and is a little more informative (i.e., has word form, gender, hyphenation, IPA pronunciation, meaning, abbreviations, synonyms and examples). My main target format would be StarDict, with possible spinoff formats for Kobo (dicthtml?), PocketBook (?) and Tolino (quickdic).

Too bad pyglossary doesn’t support R. Döffinger’s quickdic format, because Tolino devices use that, and we do have a rather large Tolino user base in Germany. Not everybody wants to jailbreak their device…

I currently use DE Wiktionary dumps and a rather brute-force Rexx script to generate a Tabfile, which I then convert to StarDict and dicthtml formats. (See attached screenshots for how it looks in GoldenDict on Linux.)

This is of course a flakey way to do it, and I’d prefer to collaborate with a more sound foundation like yours and integrate it there, also because yours gets auto-updated.

Unfortunately, the HOWTO Add a New Locale section in the wiki here isn’t too detailed, and I’d probably need quite a bit of help to get started. I’m especially unsure about the first two steps and the "Remove all data from the old lang."

So my questions are:

  1. Would you be interested in a German dictionary that should look approximately like the screenshots show?
  2. Is it possible to do, without investing too much time? (There’s a lot of other things I have to spend my time on, but I’d be willing to invest a substantial amount of time to get it started and polished a little.)
  3. Is there any assistance possible in getting me set up to get the first steps done? I reckon that’d be to set up a working environment on my Linux Mint 20.3 machine, do a fork, and start adding a language "de".
  4. Since I know almost nothing about Wiktionary’s internal structures, I fear the templates most. But having had a glance at your code, I think there is some expertise here…

Screenshots: This is how I envision it to look like. Users on MobileRead and the German E-Reader Forum have been quite enthusiastic about the first version. Screenshots show the StarDict version used by GoldenDict on a Linux desktop.

wiktionary - GoldenDict_001

Wiktionarys - GoldenDict_001

Auswahl_194

Links to what exists already:

BoboTiG commented 2 years ago

Hello @Moonbase59,

Thanks for your interest. Of course we can manage to add DE, it would be great!

I could add the locale, and all that is needed to support it. Then, you will have homeworks to add all "templates". By templates, we mean the formatting of Wiktionnary models. Let me start, and we will see later how to handle templates.

The project already create dictionnary files for StarDict, Kobo (dicthtml), and DictFile (.df).

Please note that the output you want is not currently supported, but we aim to improve that with #1149.

Moonbase59 commented 2 years ago

I could add the locale, and all that is needed to support it. Then, you will have homeworks to add all "templates". By templates, we mean the formatting of Wiktionnary models. Let me start, and we will see later how to handle templates.

The project already create dictionnary files for StarDict, Kobo (dicthtml), and DictFile (.df).

Thanks, this sounds like a helpful start, appreciate it!

Please note that the output you want is not currently supported, but we aim to improve that with #1149.

Well, we have to start somewhere. Improving can be done as a second step.

What is the default StarDict format the project generates? 2.4.2 or 3? Using HTML, i.e. sametypesequence=h?

ilius commented 2 years ago

What is the default StarDict format the project generates? 2.4.2 or 3? Using HTML, i.e. sametypesequence=h?

StarDict 3.0.0 PyGlossary automatically detects html tags and switches to sametypesequence=h.

Moonbase59 commented 2 years ago

Thanks for the info @ilius. Good to see you here. While we are at it: See any chance pyglossary could support the .quickdic format at some point in time?

Reason being, R. Döffinger doesn’t seem to be interested in developing the generation tools much further, and the dicts generated by his tools have a lot of Wiktionary formatting junk now. (See my DictionaryPC and Dictionary issues.)

Unfortunately, the Tolino alliance decided to use that format for the Tolino e-readers, of which we do have a rather substantial user base here in Germany. Especially since Kobo abandoned the German market and leaves it to Tolino (both Kobo and Tolino hardware now from Rakuten, and they shared markets).

Would be so great to have uniform dictionaries from this project eventually, even for the darn oldstyle V6 quickdic format Tolinos use!

ilius commented 2 years ago

See any chance pyglossary could support the .quickdic format at some point in time?

There is a lot of Java code, and very nested as well, to translate to Python.

And it won't work well for most input glossaries, because QuickDic seems to be an structured or semi-structured format: definitions can be only one line, and no formatting / rich text, (in semantic dictionary terminology, seems like every "sense" is an entry).

While most glossaries have multi-line definitions, most likely HTML, but even in plaintext, all senses (forms of a word) combined in an entry.

So it doesn't seem worth the effort to be honest.

It would be best to write a Java tool (using QuickDict as library) that converts FreeDict or XDXF to QuickDic. But I'm not a Java developer.

We also discussed it in https://github.com/ilius/pyglossary/issues/314

Moonbase59 commented 2 years ago

Yeah… I do see the problems, having actually looked at the code. It also seems to support from 1…n indexes (ok, good for translation dicts), but since I’m also not a Java dev, I can’t make much of the code.

pyglossary just seemed to be the ideal place to handle it, because it’s widely used and actively developed (thanks for that!). Of course all this came up again, since this project also seems an ideal place for uniform and current e-reader dictionaries.

Too bad all the Tolino users are left out, again, by what was probably a bad design decision.

ilius commented 2 years ago

Is this Tolino? https://mytolino.com/tolino-app/ Why would an ebook reader app use QuickDic format?

Moonbase59 commented 2 years ago

Yes, that’s the app. But their main products are of course the hardware readers (closed Android and Tolino built-in as launcher, reading app and hardware control). I personally have a Tolino Vision 5 and a Tolino Tab 8" tablet (2014, the last one using open Android).

On the tablet, I use Librera and KOReader, both with StarDict. All other Tolino users aren’t that lucky—jailbreaking is quite hard and not everybody wants to do that. You can install your own hypenation library and (only) quickdic v6 dictionaries, though. At least all of their hardware readers use quickdic, their support even directs you to https://github.com/rdoeffinger/Dictionary/releases/tag/v0.3-oldformat to get "new dictionaries". Sigh.

Germans quite often choose a Tolino over others, because you’re not as locked-in as with Amazon. You can select most major German book sellers (even more than one; "the alliance") and buy from them, Adobe and LCP DRM is supported, and you can use the "Onleihe", a German library network where you can borrow e-books (library access via my local library costs me €15/year, that includes paper and e-books). The hardware is (almost) identical to Kobo, but they’re sold here a lot cheaper. Maybe that’s why they are such a big success over here, despite the old and clunky software. And that’s why I’m sad that we can’t easily have current quickdic dictionaries.

I suspect their decision towards quicdic was that R. Döffinger creates translation dicts for almost any language pair, by using Wiktionary’s "other language" data, and quickdic can have 1 index (single language), 2 indexes or even 3 (for translating to and fro), so it requires only one dict for bidirectional lookup.

ilius commented 2 years ago

https://de.wikipedia.org/wiki/Tolino From 2013 to early 2017, the technical marketing partner was Deutsche Telekom. Since the beginning of 2017, the devices have been further developed by the Japanese company Rakuten.

Interesting!

Does it have a popup dictionary that appears with selecting a phrase in a book? And it shows several definitions of a word, like in QuickDic?

Moonbase59 commented 2 years ago

Sorry @BoboTiG, we’re digressing a little… take it as discussing about future expansion. ;-)

@ilius: You can mark a word (not a phrase although some are in the dict), then select "Look up" or "Translate". The actual screenshots from my Tolino Vision 5 (using the current v6 dicts) show that nothing has been updated in the dict generation software for a while—lots of Wiktionary templates go unhandled. It produces 13 pages (!) for the word "Wiktionary", rather unusable. Translations are better, but just so-so.

Tolino dictionary screenshots.zip

Compare that to what KOReader on my Tolino Tab 8" displays using my Rexx-script-generated StarDict dictionary (I had to enter the word "Wiktionary" because it wasn’t in my e-book):

KOReader StarDict Screenshots.zip

Now you can see why Tolino users are desperate for good dictionaries.

Btw, the QuickDic site and their dictionaries have nothing to do with Mr. Döffinger’s .quickdic format. It was apparently an unintended name collision, and none of both changed their name.

ilius commented 2 years ago

@Moonbase59 Thanks for the info and screenshots.

I never saw that site. Still strange though.

BTW, QuickDic's license says it was a Google product between 2010 and 2011. Google was using another name back then?

Also, you may want to suggest supporting zim format to QuickDic's author. There is a C++ library libzim that can be used in Java (with little effort I assume). I think it's the best format for Wiktionary. They release zim files for Wiktionary, Wikipedia and (I just realized) a bunch of / every other Wikimedia website! (monthly for Wiktionary and Wikipedia I think, not sure about the rest). https://library.kiwix.org/?lang=deu

Moonbase59 commented 2 years ago

Ah, now I see what you mean. Yes, there is still the QickDic Android app. They use v7 quickdic, I think. And yes, as far as I know Reimar Döffinger’s version is a fork of an older, discontinued project.

Tolino readers don’t use that app, just the v6 dictionaries (and their own code/API/whatever to read them in the Tolino apps). So we’re stuck with v6 quickdic for Tolino devices. Probably no one will convince the Tolino alliance to invest any efforts in developing new firmware for hundreds of thousands of devices out there. (Guesstimate, they don’t publish sales figures.)

BoboTiG commented 2 years ago

Preliminary support is done. A first version will be made available at https://github.com/BoboTiG/ebook-reader-dict/releases/tag/de in a couple of minutes.

lasconic commented 2 years ago

I have a couple of questions

ilius commented 2 years ago

How does the stardict look for a german user ? https://github.com/BoboTiG/ebook-reader-dict/releases/tag/de

(I'm not German, but) On GoldenDict it looks like Wiktionary! Links in the Contents (TOC) don't work though.

GoldenDict-Wiktionary-StarDict-German

ilius commented 2 years ago

BTW, why don't you compress .df files in releases?

35M dict-de-de.df
6.6M    dict-de-de.df.bz2
9.1M    dict-de-de.df.gz

You can even directly convert from/to .df.gz or .df.bz2 with PyGlossary.

Moonbase59 commented 2 years ago

@BoboTiG: Thanks for this fast start! I’ll be checking the next days, just rebuilding some crashed machinery here…

Moonbase59 commented 2 years ago

I have a couple of questions

* Is there any way to open a QuicDic file v6 without a Tolino device ? Or assuming one would write a V6 output in this project, how can one test it ?

Hmm, I never tried the Tolino Android app, maybe that still works with v6 quickdic?

Maybe the original QuickDic Android app also still reads v6 dictionaries?

Unfortunately, due to the Java overhead inside, the v6 files are substantially larger than the v7 files (which Tolinos can’t read).

* How does the stardict look for a german user ?   https://github.com/BoboTiG/ebook-reader-dict/releases/tag/de

StarDicts I usually test on


General: I feel that dicts generated from this project have too much vertical whitespace—not good for people using smaller e-readers. Many still use 6" screens or even smaller. From user feedback, that was the reason my StarDict looks so compressed, with almost no vertical whitespace.

I think the main point is usability here—present concise information "at a glance". Most users of mine quite liked this "short form" which still shows more info than this project’s does. Only the examples can be quite lengthy at times, so I put these at the end.

Dictionary popups are often rather small, too—Here’s an example of how my StarDict looks on a user’s Onyx Leaf. Imagine what we produce here in that small window!

ilius commented 2 years ago

I have collected a list of apps for StarDict format here.

lasconic commented 2 years ago

My question regarding the german stardict was more about the content (obvious missing words, obvious missing templates etc...). I ask both questions because I don't speak german and I don't have a tolino device :)

Regarding vertical spacing, I agree. The spacing is not ideal, even on Kobo. We could revisit our HTML a bit and remove a couple of line breaks maybe. @BoboTiG any thoughts ? worth another issue ?

Regarding v6 quickdic, the java code seems pretty self contained ? https://github.com/rdoeffinger/DictionaryPC/blob/master/src/com/hughes/android/dictionary/engine/DictionaryV6Writer.java

BoboTiG commented 2 years ago

Let's open 2 issues: one for the vertical spacing to experiment new things, and one for the DictFile compression :+1:

Moonbase59 commented 2 years ago

My question regarding the german stardict was more about the content (obvious missing words, obvious missing templates etc...). I ask both questions because I don't speak german and I don't have a tolino device :)

I understand. Unfortunately, the dict generation doesn’t work from readable text files, but from chunks of the dumped Wiktionary data (when you run it, from data/inputs/wikiSplit/de/DE.data.gz and others). So no way to have a readable input wordlist…

And I don’t speak Java, so I’m unable to understand/change the DictionaryPC code. :-(

Moonbase59 commented 2 years ago

@BoboTiG: Would this be the right time for me to start working on a fork? In order to prepare for PRs?

Man, I haven’t done all this GitHub stuff for a while now… ;-)

BoboTiG commented 2 years ago

@Moonbase59 yep :)

Moonbase59 commented 2 years ago

@BoboTiG: Okay, forked and first small PR done (https://github.com/BoboTiG/ebook-reader-dict/pull/1176).

Next steps—and questions:

Thanks for helping me along!

lasconic commented 2 years ago

How to generate working dicts locally, to test them?

Running

 python -m wikidict de

should generate the dictionary for german df, stardict and kobo format. It's a 4 steps process, and you can run each step separately if you want (see main.py)

 python -m wikidict de --download
 python -m wikidict de --parse 
 python -m wikidict de --render 
 python -m wikidict de --convert 

Can I use git add ., i.e. is your .gitgnore robust enough to exclude everything unwanted?

Most of the time yes, but probably not if you store all sort of files in the root of the project.

Must I/Can I build new releases within my fork? Or is that done centrally on your side?

Releases are done automatically by github workflow.

Moonbase59 commented 2 years ago

Thanks! It’s all Python 3, right? (I have no python, would use python3 then.)

Next question: Gender. In Germany, we have a legal "third gender" since 2018, to be used for transgender persons. It is usually abbreviated "d" (for "divers") but it seems it is not (yet?) used in the German Wiktionary. At least, I couldn’t find any entries.

For nouns, they currently only use variants/combinations of "f" (feminine), "m" (masculine), "n" (neuter) and "u" (utrum=common gender; used in Swedish: nouns are either neuter or utrum).

It follows that we (in general) should support the "u" in gender templates, but currently don’t have anything for Germany’s third "d" gender. Unfortunately I can’t readily find the comments in the PR where we talked about it, but I’d suggest going for a char class of [fmnu] in the regex and implement anything more whenever Wiktionary decides what to do about the "d" gender.

Objections?

Moonbase59 commented 2 years ago
matthias@e6510:~/Projekte/ebook-reader-dict$ python3 -m wikidict de
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/matthias/Projekte/ebook-reader-dict/wikidict/__main__.py", line 42, in <module>
    from docopt import docopt
ModuleNotFoundError: No module named 'docopt'

Install docopt using pip3? With or without sudo?

EDIT: Installed docopt, next missing is cachetools… Is there a dependency list somewhere, or something like "auto-install deps"?

Ah, requirements.txt and requirements-tests.txt! Is there a command I can use to install the requirements from these files?

Ah, ok, got it:

pip3 install -r requirements.txt
pip3 install -r requirements-tests.txt

BUT now I get an error about incompatible urllib3 version:

matthias@e6510:~/Projekte/ebook-reader-dict$ pip3 install -r requirements-tests.txt
Requirement already satisfied: setuptools>=36.2.1 in /usr/lib/python3/dist-packages (from -r requirements.txt (line 1)) (45.2.0)
Requirement already satisfied: beautifulsoup4==4.10.0 in /home/matthias/.local/lib/python3.8/site-packages (from -r requirements.txt (line 2)) (4.10.0)
Requirement already satisfied: cachetools==5.0.0 in /home/matthias/.local/lib/python3.8/site-packages (from -r requirements.txt (line 3)) (5.0.0)
Requirement already satisfied: docopt==0.6.2 in /home/matthias/.local/lib/python3.8/site-packages (from -r requirements.txt (line 4)) (0.6.2)
Requirement already satisfied: marisa-trie==0.7.7 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 5)) (0.7.7)
Requirement already satisfied: mistune==2.0.2 in /home/matthias/.local/lib/python3.8/site-packages (from -r requirements.txt (line 6)) (2.0.2)
Requirement already satisfied: pillow==9.0.0 in /home/matthias/.local/lib/python3.8/site-packages (from -r requirements.txt (line 7)) (9.0.0)
Requirement already satisfied: pyglossary==4.4.1 in /home/matthias/.local/lib/python3.8/site-packages (from -r requirements.txt (line 8)) (4.4.1)
Requirement already satisfied: requests==2.27.1 in /home/matthias/.local/lib/python3.8/site-packages (from -r requirements.txt (line 9)) (2.27.1)
Requirement already satisfied: sympy==1.9 in /home/matthias/.local/lib/python3.8/site-packages (from -r requirements.txt (line 10)) (1.9)
Requirement already satisfied: wikitextparser==0.48.0 in /home/matthias/.local/lib/python3.8/site-packages (from -r requirements.txt (line 11)) (0.48.0)
Collecting black==21.12b0
  Downloading black-21.12b0-py3-none-any.whl (156 kB)
     |████████████████████████████████| 156 kB 3.6 MB/s 
Collecting flake8==4.0.1
  Downloading flake8-4.0.1-py2.py3-none-any.whl (64 kB)
     |████████████████████████████████| 64 kB 1.6 MB/s 
Collecting mypy==0.931
  Downloading mypy-0.931-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (16.3 MB)
     |████████████████████████████████| 16.3 MB 9.9 MB/s 
Collecting pdoc3==0.10.0
  Downloading pdoc3-0.10.0-py3-none-any.whl (135 kB)
     |████████████████████████████████| 135 kB 11.5 MB/s 
Collecting pydocstyle==6.1.1
  Downloading pydocstyle-6.1.1-py3-none-any.whl (37 kB)
Collecting pytest==6.2.5
  Downloading pytest-6.2.5-py3-none-any.whl (280 kB)
     |████████████████████████████████| 280 kB 6.9 MB/s 
Collecting pytest-cov==3.0.0
  Downloading pytest_cov-3.0.0-py3-none-any.whl (20 kB)
Collecting pytest-dependency==0.5.1
  Downloading pytest-dependency-0.5.1.tar.gz (27 kB)
Collecting responses==0.17.0
  Downloading responses-0.17.0-py2.py3-none-any.whl (38 kB)
Collecting types-cachetools==4.2.9
  Downloading types_cachetools-4.2.9-py3-none-any.whl (4.7 kB)
Collecting types-requests==2.27.7
  Downloading types_requests-2.27.7-py3-none-any.whl (11 kB)
Requirement already satisfied: soupsieve>1.2 in /usr/lib/python3/dist-packages (from beautifulsoup4==4.10.0->-r requirements.txt (line 2)) (1.9.5)
Requirement already satisfied: charset-normalizer~=2.0.0; python_version >= "3" in /home/matthias/.local/lib/python3.8/site-packages (from requests==2.27.1->-r requirements.txt (line 9)) (2.0.11)
Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests==2.27.1->-r requirements.txt (line 9)) (2019.11.28)
Requirement already satisfied: idna<4,>=2.5; python_version >= "3" in /usr/lib/python3/dist-packages (from requests==2.27.1->-r requirements.txt (line 9)) (2.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/lib/python3/dist-packages (from requests==2.27.1->-r requirements.txt (line 9)) (1.25.8)
Requirement already satisfied: mpmath>=0.19 in /home/matthias/.local/lib/python3.8/site-packages (from sympy==1.9->-r requirements.txt (line 10)) (1.2.1)
Requirement already satisfied: regex in /home/matthias/.local/lib/python3.8/site-packages (from wikitextparser==0.48.0->-r requirements.txt (line 11)) (2022.1.18)
Requirement already satisfied: wcwidth in /home/matthias/.local/lib/python3.8/site-packages (from wikitextparser==0.48.0->-r requirements.txt (line 11)) (0.2.5)
Collecting tomli<2.0.0,>=0.2.6
  Downloading tomli-1.2.3-py3-none-any.whl (12 kB)
Collecting typing-extensions>=3.10.0.0
  Downloading typing_extensions-4.0.1-py3-none-any.whl (22 kB)
Collecting click>=7.1.2
  Downloading click-8.0.3-py3-none-any.whl (97 kB)
     |████████████████████████████████| 97 kB 3.5 MB/s 
Collecting platformdirs>=2
  Downloading platformdirs-2.4.1-py3-none-any.whl (14 kB)
Collecting mypy-extensions>=0.4.3
  Downloading mypy_extensions-0.4.3-py2.py3-none-any.whl (4.5 kB)
Collecting pathspec<1,>=0.9.0
  Downloading pathspec-0.9.0-py2.py3-none-any.whl (31 kB)
Collecting pyflakes<2.5.0,>=2.4.0
  Downloading pyflakes-2.4.0-py2.py3-none-any.whl (69 kB)
     |████████████████████████████████| 69 kB 3.2 MB/s 
Collecting mccabe<0.7.0,>=0.6.0
  Downloading mccabe-0.6.1-py2.py3-none-any.whl (8.6 kB)
Collecting pycodestyle<2.9.0,>=2.8.0
  Downloading pycodestyle-2.8.0-py2.py3-none-any.whl (42 kB)
     |████████████████████████████████| 42 kB 418 kB/s 
Requirement already satisfied: mako in /usr/lib/python3/dist-packages (from pdoc3==0.10.0->-r requirements-tests.txt (line 5)) (1.1.0)
Requirement already satisfied: markdown>=3.0 in /usr/lib/python3/dist-packages (from pdoc3==0.10.0->-r requirements-tests.txt (line 5)) (3.1.1)
Collecting snowballstemmer
  Downloading snowballstemmer-2.2.0-py2.py3-none-any.whl (93 kB)
     |████████████████████████████████| 93 kB 534 kB/s 
Collecting py>=1.8.2
  Downloading py-1.11.0-py2.py3-none-any.whl (98 kB)
     |████████████████████████████████| 98 kB 3.1 MB/s 
Requirement already satisfied: toml in /home/matthias/.local/lib/python3.8/site-packages (from pytest==6.2.5->-r requirements-tests.txt (line 7)) (0.10.2)
Requirement already satisfied: packaging in /usr/lib/python3/dist-packages (from pytest==6.2.5->-r requirements-tests.txt (line 7)) (20.3)
Collecting pluggy<2.0,>=0.12
  Downloading pluggy-1.0.0-py2.py3-none-any.whl (13 kB)
Collecting iniconfig
  Downloading iniconfig-1.1.1-py2.py3-none-any.whl (5.0 kB)
Collecting attrs>=19.2.0
  Downloading attrs-21.4.0-py2.py3-none-any.whl (60 kB)
     |████████████████████████████████| 60 kB 4.0 MB/s 
Collecting coverage[toml]>=5.2.1
  Downloading coverage-6.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (211 kB)
     |████████████████████████████████| 211 kB 7.0 MB/s 
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from responses==0.17.0->-r requirements-tests.txt (line 10)) (1.14.0)
Collecting types-urllib3<1.27
  Downloading types_urllib3-1.26.8-py3-none-any.whl (13 kB)
Building wheels for collected packages: pytest-dependency
  Building wheel for pytest-dependency (setup.py) ... done
  Created wheel for pytest-dependency: filename=pytest_dependency-0.5.1-py3-none-any.whl size=8199 sha256=317cb9fae73b5024a1610eb886ed542198686e8ca1e3f698e402b7fcf5d60606
  Stored in directory: /home/matthias/.cache/pip/wheels/e0/ce/20/537162ba9c5a3a7e2d64cf312c738c728a3c51e9b052ba16a0
Successfully built pytest-dependency
ERROR: responses 0.17.0 has requirement urllib3>=1.25.10, but you'll have urllib3 1.25.8 which is incompatible.
Installing collected packages: tomli, typing-extensions, click, platformdirs, mypy-extensions, pathspec, black, pyflakes, mccabe, pycodestyle, flake8, mypy, pdoc3, snowballstemmer, pydocstyle, py, pluggy, iniconfig, attrs, pytest, coverage, pytest-cov, pytest-dependency, responses, types-cachetools, types-urllib3, types-requests
  Attempting uninstall: pycodestyle
    Found existing installation: pycodestyle 2.7.0
    Uninstalling pycodestyle-2.7.0:
      Successfully uninstalled pycodestyle-2.7.0
Successfully installed attrs-21.4.0 black-21.12b0 click-8.0.3 coverage-6.3 flake8-4.0.1 iniconfig-1.1.1 mccabe-0.6.1 mypy-0.931 mypy-extensions-0.4.3 pathspec-0.9.0 pdoc3-0.10.0 platformdirs-2.4.1 pluggy-1.0.0 py-1.11.0 pycodestyle-2.8.0 pydocstyle-6.1.1 pyflakes-2.4.0 pytest-6.2.5 pytest-cov-3.0.0 pytest-dependency-0.5.1 responses-0.17.0 snowballstemmer-2.2.0 tomli-1.2.3 types-cachetools-4.2.9 types-requests-2.27.7 types-urllib3-1.26.8 typing-extensions-4.0.1
matthias@e6510:~/Projekte/ebook-reader-dict$ 

Phew. Got it somehow by updating urllib3 using

pip3 install --upgrade urllib3

(Just left all this here for others, in case they get the same errors.)

BoboTiG commented 2 years ago

It follows that we (in general) should support the "u" in gender templates, but currently don’t have anything for Germany’s third "d" gender. Unfortunately I can’t readily find the comments in the PR where we talked about it, but I’d suggest going for a char class of [fmnu] in the regex and implement anything more whenever Wiktionary decides what to do about the "d" gender.

Have a look here: https://github.com/BoboTiG/ebook-reader-dict/blob/c770a1916a1488f57016837ebbd803358f011df7/wikidict/lang/de/__init__.py#L8

BoboTiG commented 2 years ago

@Moonbase59 do you think it is useful to let that issue open?

Moonbase59 commented 2 years ago

This one? Depends on your workflow, I’ll agree with whatever you suggest. Probably not, since I’d be adding tons of other stuff here instead of opening separate issues…

Then again, it’s kind of documentation on how to add a new locale (although you did all the main work yet).

I would have asked about how to progress from here next, i.e. continue with

python3 -m wikidict de --find-templates

?

BoboTiG commented 2 years ago

Yes, and have a look at the sections.txt file. It contains all sections from all words (there will be a lot of noise, and you unlikely want to handle all of them). Pick interesting/pertinent ones and fill https://github.com/BoboTiG/ebook-reader-dict/blob/0bd2296e6ba146fbbe3a2af5dba96cdb651c5cec/wikidict/lang/de/__init__.py#L19-L22.

Then, run the same command again, and look at templates.txt. There, you will see all templates not handled (lot of noise too, lot of false positives).

It is a good habit to add a test when adding support for a new section or template.

lasconic commented 2 years ago

I don't speak german at all. But here is a list of templates that probably need to be handled https://de.wiktionary.org/wiki/Kategorie:Wiktionary:Markierung

Moonbase59 commented 2 years ago

Thanks for that! I don’t really get started, because we find & fix so many other things… ;-)

I’m still also hoping I can start with some things I already seem to handle ok in my Rexx script. Just did a new run today, and its output still looks pretty decent.

For verification/comparison (d/l links stay the same): Kobo and StarDict version.

lasconic commented 2 years ago

The german wiktionary has a lot stuff like Capture 2022-02-04 à 21 15 44 what would be the expected result in a ebook reader dictionary (since we don't support links)

Moonbase59 commented 2 years ago

You mean like in the entry for "hämat-"? Well, depends on how much we are into etymology. In this case, it says a word stems from Greek αἷμα (blood) and the romanisation/pronunciation of that, "haima".

I’d probably either ignore it, or put it in the etymology section (without the →grc pointer) and leave "Blut" in, so users could long-press "Blut" in the description and jump to the entry for it.

Interesting point actually, the lady with the Kobo I talked to before is an editor in a small publishing house, and she’s very much with me regarding etymology entries. Said "Why do you put the etymology in front of all other things someone wants to know about a word, like its meaning? Nobody wants to read 3 pages of how a word came into existence and who mispronounced it in the 16th century—that’s only of interest for linguists, anyway. (Some etymology entries are actually rather long.) If you want to keep it, please at least put it at the end, after examples." (She currently uses my German version but had a peek at what we currently generate.)

Heretic question: Why do we have it thus prominently? The German Wiktionary often has only quite sophisticated—or lengthy—explanations here that really interest no one, like

Schiffbau

Determinativkompositum (Zusammensetzung) aus den Substantiven Schiff und Bau

Well, who’d have guessed that "Schiffbau" (shipbuilding) is composed of "Schiff" and "Bau" and that linguists call that a "Determinativkompositum" (determinative compound)… No knowledge gained, time wasted.

Or, take the etymology entry for "grün" (green):

Grün geht auf das althochdeutsche gruoni zurück. Dieses und das mittelhochdeutsche grüene, das altsächsische grōni, das mittelniederdeutsche grȫne, das mittelniederländische groene, das niederländische groen → nl, das altenglische grēne → ang, das englische green → en, das altnordische grœn → non sowie das schwedische grön → sv gehen auf ein zur Zeit des Neuhochdeutschen untergegangenes Verb zurück, das im Althochdeutschen gruoen, im Mittelhochdeutschen grüejen (wachsen, sprießen), im Mittelniederdeutschen grōjen, im Mittelniederländischen groeyen und grōyen, im Niederländischen groeien → nl (wachsen), im Altenglischen grōwan → ang (wachsen, sprießen), im Englischen grow → en (wachsen) und im Altnordischen grōa → non (wachsen, grünen) lautete. Sowohl das Adjektiv als auch das Verb lassen sich auf die indoeuropäische Form ghrō- und somit auf die indoeuropäische Wurzel gher- und *gherə- (hervorstechen (bei Trieben von Pflanzen, Stacheln, Borsten, Kanten), wachsen, grünen). Verwandte Formen sind Grat, Gräte, Granne und Gras.[1] Angesichts dieser Herkunft lassen sich für grün die Bedeutungen ‚sprießend‘ und ‚hervorwachsend‘ erschließen. Jedoch wurde die Bedeutung schon sehr früh auf die Farbe der sprießenden Pflanzen beschränkt und damit stand das Adjektiv eigentlich für ‚von einer Farbe wie sprießende Pflanzen‘. Im Deutschen wird grün allerdings nicht nur als Farbadjektiv benutzt, sondern bildet auch das Gegenteil von trocken und verwelkt und von reif und rot. Da Grün zudem mit dem Frühling in Verbindung gebracht wird, wurde die Farbe zum Sinnbild von Freude, Frohsinn und Hoffnung.[2]

Yawn… (Okay, I’m no linguist and agree that it might be helpful for some.)

On the poor user’s small e-reader screen this will take several pages until we get to what she looks for, namely meaning, or maybe synonyms or examples.

Actually, since she’s an editor and I lecture sometimes, we found that we both are rather interested in synonyms. (Not so much in antonyms.) My (paper) Roget’s is typically within reach… ;-)

BoboTiG commented 2 years ago

English has consequent etymologies too.

Actually I thought about moving the etymology below definitions. Maybe we even talked about it with @lasconic, but I did not find any discussion.

I do not have a strong opinion, and it could be easy to change that for the Kobo.

lasconic commented 2 years ago

I don't remember this discussion but we might want to see if https://developer.mozilla.org/en-US/docs/Web/HTML/Element/details is supported on Kobo. Kind of related to https://github.com/BoboTiG/ebook-reader-dict/issues/1172 ?

Moonbase59 commented 2 years ago

Using <details> has been discussed a lot on e-reader forums, because it could be rather nice to have but support for it in e-reader rendering software seems flakey at best.

Maybe best to (currently) not use it and just really put etymology after the definitions for now? (I will eventually nag you to include more stuff anyway, hee hee. But one step after the other.)

lasconic commented 2 years ago

I implemented a few german templates and file issues for the ones I can't implement even with Google Translate. You can find them with the locale:Deutsch label https://github.com/BoboTiG/ebook-reader-dict/issues?q=is%3Aissue+is%3Aopen+label%3Alocale%3ADeutsch.

I catched non implemented template with

print(f"[{word}] --- {template[0]}")

at the end of de last_template_handler function and ran a script to find the number of times the template is used.

So here is a list of german templates needing our attention (and the number of times they appear in the sections we are interested in)

https://gist.github.com/lasconic/7eac492986876dc87b9381e788d0aab3

Moonbase59 commented 2 years ago

Thanks for your help! I’ll try to work on the templates soon. Is there some definite place in the Wiktionaries where they elaborate on templates used? I seem never be able to find that…

lasconic commented 2 years ago

You mean all the templates used in the whole wiktionary for a given locale ? I don't know. For de, I found

In any case, we don't need to implement all templates but just the ones used in the sections we render.

Moonbase59 commented 2 years ago

Thanks! Do we have to exclude those we don’t (currently) use?

lasconic commented 2 years ago

Thanks! Do we have to exclude those we don’t (currently) use?

There is no need to implement the ones we don't use for now. We all have limited time, so let's focus on the most impactful ones ?


I just merged a PR with support of more templates. The gist is updated: https://gist.github.com/lasconic/7eac492986876dc87b9381e788d0aab3 Templates to support : 116 --> 66 Number of times the templates are used : 1302 --> 818

tjaderxyz commented 2 years ago

I have been using this dictionary on KOReader and it's been serving me well, thanks for making it available!

The main problem I have so far is the lack of inflected forms. For example, it has an entry for the verb tragen, but not for the conjugated form trage, and it has an entry for the adjective rot, but not for the declined form rote.

In both cases there's an entry for the inflected form in Wiktionary, saying it's an inflected form. In my opinion the ideal result would be for the inflected form to lead directly to the main form's definition, but just showing the inflected form definition would already be better than the current behaviour.

I took a look at the code but couldn't figure out either why the inflected forms are being dropped nor how to make one word lead to another word's definition, but I'm not too familiar with how the generation works overall.

lasconic commented 2 years ago

Thank you for your feedback. Since you use KOreader, you use the stardict format. Right now I'm not sure if "variants" will be carry on when we convert from Kobo format to Stardict but in any case, variants are not implemented in German yet. If they would be implemented the same way than for French and Spanish it would work the way you describe. You can see how I did it in spanish here : https://github.com/BoboTiG/ebook-reader-dict/pull/1227

Reading https://github.com/ilius/pyglossary/blob/master/pyglossary/plugins/ebook_kobo_dictfile.py it seems variants are read from Kobo file, so it could be that they work in stardict.

tjaderxyz commented 2 years ago

I tested the Spanish dictionary in KOReader and it works as expected, both rojo and roja bring the definition for rojo.

Thanks for pointing to the Spanish implementation commit, I'll study it when I have some time and try to implement something similar for German (if no one gets to it before me).

tjaderxyz commented 2 years ago

I have made a preliminar implementation in #1256, from my testing the resulting dictionary is satisfactory.

BoboTiG commented 2 years ago

I close that one as the primary subject is now well handled :) Thanks everyone :champagne: