BoboTiG / ebook-reader-dict

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.
http://www.tiger-222.fr/?d=2020/04/17/22/14/21-un-dictionnaire-alternatif-et-complet-pour-votre-liseuse
MIT License
395 stars 21 forks source link

Feature request: bilingual dictionaries #973

Open chopinesque opened 3 years ago

chopinesque commented 3 years ago

Would it be possible to create subdictionaries based on EN wiktionary for other languages? For example, German-English (here is a German word: https://en.wiktionary.org/wiki/Nacht)

Upvote & Fund

Fund with Polar

lasconic commented 3 years ago

Changing this line could work : https://github.com/BoboTiG/ebook-reader-dict/blob/master/wikidict/lang/en/__init__.py#L15 But I'm not sure why you would like to do so. For EN/DE kobo dictionary, you might want to check http://download.wikdict.com/dictionaries/kobo/ If it's not what you are looking for, please explain more in details.

chopinesque commented 3 years ago

The German was just an example. The idea is to produce bilingual dictionaries based on the EN one for example (and not from the Translations section of the English words). For example, when it comes to Ancient Greek, there is much larger coverage in main entries rather than entries in the Translations section.

Would changing that line you mention suffice? I read the add new local section but I am a little confused on how exactly to run it on a local Wiktionary dump.

lasconic commented 3 years ago

I replaced the line in question by

head_sections = ("==German==", "german")

And ran (sorry, my german is very very limited)

python -m wikidict en --gen-dict=Nacht,Kartoffel,schwarz --output=Nacht

And I got the attached file in the Nacht directory. You can try it on your Kobo and see if the 3 words can be find and look good. dicthtml-en.zip

chopinesque commented 3 years ago

Thank you! Sadly, I use tsv or Stardict (no Kobo).

lasconic commented 3 years ago

Which language would you be the most interested in ?

chopinesque commented 3 years ago

Greek and Ancient Greek.

I can see the part of speech templates have an "el" (el-adj, el-verb...) or "grc" prefix for Greek and Ancient Greek respectively. Does the script figure out the templates by itself or one needs to add/finetune them?

lasconic commented 3 years ago

I believe part of speech are not extracted at all right now. @BoboTiG can confirm. We just use them to choose which definition we keep or not.

chopinesque commented 3 years ago

Yes, that is what I meant, these templates are needed to decide which part should be extracted and which not :)

lasconic commented 3 years ago

It seems to work without finetuning then. I changed the line to:

head_sections = ("==Ancient Greek==", "ancientgreek")

and ran

python -m wikidict en --get-word="Γραῖα"

I got the following, compare with https://en.wiktionary.org/wiki/%CE%93%CF%81%CE%B1%E1%BF%96%CE%B1

Γραῖα   

A name meaning "grey", from Proto-Indo-European *ǵerh₂- (“to grow old”).

  1. Graea, Boeotia; Greece
BoboTiG commented 3 years ago

Indeed, we are only using parts that mater to the language: the project was not designed for cross-language stuff.

You could play with it and see how it works. Make a copy of the langs/en folder to langs/en_grc or something like that and tune templates handling and sections names.

chopinesque commented 3 years ago

Well, cross-language could be another possibility then, but thank you so much for all the work so far -:) Having checked the relevant page, I am a bit at a loss at how to run the script on a wiktionary dump.

The Γραῖα example appears to maintain the Etymology, I guess this is not included normally.

lasconic commented 3 years ago

Etymology is always included in the other languages.

To run it on a dump, checkout the code, install the requirements, change the line for the language and run

python -m wikidict en 

After some time, you will get a directory with .df file. You can convert it to Stardict with pyglossary:

pyglossary --no-progress-bar --no-color data/en/dict-en.df dict-data.ifo
chopinesque commented 3 years ago

But how is the path of the dump defined? (Yes, I found this project via pyglossary -:) )

lasconic commented 3 years ago

The dump will be downloaded in data/en

chopinesque commented 3 years ago

So the script downloads the dump automatically?

lasconic commented 3 years ago

Yes

lasconic commented 3 years ago

I just ran the first steps, and there are only 16,431 in ancient greek.

lasconic commented 3 years ago

grc.zip

chopinesque commented 3 years ago

Wow, looks quite good after a quick look. Many thanks. Some issues:

3.1 is a quotations drop down which is converted to <i>Q</i> <b>Od.</b> https://en.wiktionary.org/wiki/%CE%BB%CE%B1%CE%BC%CE%B2%CE%AC%CE%BD%CF%89

lasconic commented 3 years ago

the quotation block is supposed to be entirely removed.

chopinesque commented 3 years ago

I guess then there is some difference in syntax so that the current regex for that block does not match it.