BoboTiG / ebook-reader-dict

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.
http://www.tiger-222.fr/?d=2020/04/17/22/14/21-un-dictionnaire-alternatif-et-complet-pour-votre-liseuse
MIT License
391 stars 21 forks source link

Pronunciation output: "colon space" before, "\" and other issues. #1174

Closed Moonbase59 closed 2 years ago

Moonbase59 commented 2 years ago

I always wonder why we have the "colon space" artifact before the pronunciation.

Wouldn’t it be better to show which phonetic alphabet is shown instead? As in:

IPA: [trɑːnsˈkrɪpʃn̩] X-SAMPA: [trA:ns"krIpSn_=]

(X-SAMPA is often used in Text-to-Speech systems, dictionaries mostly use the IPA.)

I have no idea how many entries in the Wiktionaries are using SAMPA or X-SAMPA, probably only a few. Might still be helpful to show which, don’t you think? Or only take the IPA, but then remove the ": " artifact.


IPA has no backslash, as far as I know. But we still generate things like

: \ˈwɪkʃən(ə)ɹi, \ˈwɪkʃənɹɪ\

which I believe are leftover artifacts from somewhere having quotes escaped.


Traditionally, IPA pronunciation is also enclosed in square brackets (as shown above), but I don’t know the reason for it. Should we adapt that?

EDIT: Found it: https://en.wikipedia.org/wiki/International_Phonetic_Alphabet#Brackets_and_transcription_delimiters

EDIT 2: The word "Wiktionary" (EN) is given as

Pronunciation

    (UK) IPA(key): /ˈwɪkʃən(ə)ɹi/, (Received Pronunciation) IPA(key): /ˈwɪkʃənɹɪ/

in the EN WIktionary.

Currently, we show it as:

: \ˈwɪkʃən(ə)ɹi, \ˈwɪkʃənɹɪ\

We should be sure to take the whole definitions (including slashes, brackets, stress marks) into our output, so more like:

IPA: /ˈwɪkʃən(ə)ɹi/, /ˈwɪkʃənɹɪ/

or (without the "IPA: "):

/ˈwɪkʃən(ə)ɹi/, /ˈwɪkʃənɹɪ/

Call me a nitpicker—I just love quality! :-)

BoboTiG commented 2 years ago

I always wonder why we have the "colon space" artifact before the pronunciation.

Leading colon+space are a glitch, they should not be there.

Traditionally, IPA pronunciation is also enclosed in square brackets (as shown above), but I don’t know the reason for it. Should we adapt that?

Backslashes are used in the French Wiktionary, and we used it as a basis for all other dicts. Each locale has its own way of displaying IPAs, so let's go with the brackets for all of them :+1:

Let's also tackle multiple IPAs on the English Wiktionary, thanks for the report :)

Moonbase59 commented 2 years ago

Stop… Interesting to know the French use backslashes. And after reading the EN Wiki explanation mentioned above, I’d instead opt for taking what’s there (in the Wiktionary). Including whatever "boundary characters" they use.

Btw, reading it in French (which I don’t speak), it looks like they also use /…/ and […]? How do your printed dictionaries look like? Interestingly, the French Wiktionary indeed uses backslashes, see the entry for "test".

Rationale: Our dicts should be as professional and usable as possible. Agree? So, if different countries use different symbols, it might possibly be better to use these, sacrificing just a little uniformity.

Since we’re currently producing only reference dictionaries, not translation dictionaries, it might be wise to stick with what the users of each country are used to (and what’s correct for them). A foreigner has to learn what’s correct for the selected language, right? (As he would have to learn the language.) And local users will feel "at home".

Maybe we can get @chopinesque’s feedback on this, since (s)he is a pro user?

chopinesque commented 2 years ago

Agreed, localization has to do with adapting things to the user's locale. If we have things adapted to their locale and then we somehow "normalize" them to fit our standardized approach, they may not feel perfectly "at home".

For example, the French tend to use non-breaking spaces before a number of characters, including colon (:). This is something we would never do in English. So if we had an Anglocentric normalization approach, all these thin spaces would go.

That said, if we are presenting multilingual data then there may have to be a marriage between locale-specific idiosyncrasies and convenience. At the end of the day, the person(s) making all the effort have to decide whether any extra work required is worth the trouble, or whether they have time for that extra work.

BoboTiG commented 2 years ago

I am +1 on using what is defined by the locale.

lasconic commented 2 years ago

I can't reproduce the : in pronunciation. Can anyone explain how to see it ?

Moonbase59 commented 2 years ago

I simply downloaded the EN StarDict and looked up the word "Wiktionary" (using GoldenDict on Linux). We have it there. Probably a leftover artifact from removing the "IPA" before the pronunciation, I think.

lasconic commented 2 years ago

Mmm weird,

I don't see it with

python -m wikidict en --get-word "Wiktionary" --raw
Moonbase59 commented 2 years ago

Interesting. Your command looks good here, too.

But if you have a peek into data/en/dict-en-en.df, it looks like this:

@ Wiktionary
: \ˈwɪkʃən(ə)ɹi\, \ˈwɪkʃənɹɪ\ 
<html><p>Blend of <i>wiki</i>&nbsp;+&nbsp;<i>dictionary</i>.</p></br>
<ol><li>A collaborative project run by the Wikimedia Foundation to produce a free and complete dictionary in every language; the dictionaries, collectively, produced by that project.</li><li>A particular version of this dictionary project, written in a certain language, such as the English-language Wiktionary (often known simply as the English Wiktionary).</li></ol>

(for all entries having pronunciation)

Seems the Kobo dicthtml does also not have the "colon space". Taken from data/en/tmp/wi.raw.html (beautified):

<w>
  <p><a name="Wiktionary" /><b>Wiktionary</b> \ˈwɪkʃən(ə)ɹi\, \ˈwɪkʃənɹɪ\<br /><br />
  <p>Blend of <i>wiki</i>&nbsp;+&nbsp;<i>dictionary</i>.</p></br>
  <ol>
    <li>A collaborative project run by the Wikimedia Foundation to produce a free and complete dictionary in every language; the dictionaries, collectively, produced by that project.</li>
    <li>A particular version of this dictionary project, written in a certain language, such as the English-language Wiktionary (often known simply as the English Wiktionary).</li>
  </ol>
  </p>
</w>

… which brings me to the next bug: </br>?! Probably a typo, meant to be <br/>?

BoboTiG commented 2 years ago

It seems an artefact from when PyGloassry is creating the StarDict :thinking:

I guess this is the case since we introduced StarDict support, but as I never used it, it may be gone under our radar.

Moonbase59 commented 2 years ago

So we’re a perfect match: I almost exclusively use StarDict! :grin:

Moonbase59 commented 2 years ago

The bad </br> is here: https://github.com/BoboTiG/ebook-reader-dict/blob/9a02781b5f0840520aad2c9def08ba87137bac1c/wikidict/convert.py#L75

BoboTiG commented 2 years ago

Oh good catch! If that is the issue, mind opening a PR? :)

lasconic commented 2 years ago

If this bad br went unnoticed for so long, maybe it should be remove ?

BoboTiG commented 2 years ago

We need to check, I guess it depends of the flexibility of the HTML parser.

If it is useless on Kobo too, then let's remove it, yeah.

lasconic commented 2 years ago

Ok. I'll file another issue for the BR.

For the colon, we output it correctly in the df file. The colon is necessary there ! See https://pgaskin.net/dictutil/dictgen/ So it's a bug in pyglossary. It should removed the : when parsing the file.

lasconic commented 2 years ago

I filed an issue https://github.com/ilius/pyglossary/issues/358

Moonbase59 commented 2 years ago

For the colon, we output it correctly in the df file. The colon is necessary there ! See https://pgaskin.net/dictutil/dictgen/

Good catch! So that’s what puts the pronunciation next to the title… (looks odd to me).

lasconic commented 2 years ago

So that’s what puts the pronunciation next to the title… (looks odd to me).

The default dictionary on Kobo does that.

ilius commented 2 years ago

I filed an issue https://github.com/ilius/pyglossary/issues/358

Please let me know if it's fixed.