Sigil-Ebook / Sigil

Sigil is a multi-platform EPUB ebook editor
GNU General Public License v3.0
5.87k stars 572 forks source link

List of spellcheck dictionaries (with fix) #404

Closed BeckyDTP closed 5 years ago

BeckyDTP commented 5 years ago

Hunspell dictionaries have files with underlining: xx_XX GetLanguage uses names with hyphens: xx-XX

Before

The proposed fix may not be perfect but works. The second condition is for dictionaries such as Polish, which do not have the standard two-letter version (pl), but only the extended one (pl_PL).

patch.txt

After

kevinhendricks commented 5 years ago

There are also specialist dictionaries for biology and astronomy and etc terms. So the raw dictionary name need not have a language code embedded.

kevinhendricks commented 5 years ago

Also please generate human readable unified diffs (use -u) if at all possible. Thanks

BeckyDTP commented 5 years ago

Sure. patch-u.txt

BeckyDTP commented 5 years ago

The second condition is definitely for change.

I checked LibreOffice dictionaries. Result (6 not perfect): hunspell-names-v1.pdf

Better (though still not perfect):

name = lang->GetLanguageName(Utility::Substring(0, 2, fix_dict)) + " - " + Utility::Substring(3, fix_dict.length(), fix_dict);

Result (one not perfect – gug) hunspell-names-v2.pdf

Idea:

  1. Only for dictionary names with underscore: 2a. Take the letters before the underscore and check GetLanguage. 2b. Add " - " 2c. Add the rest after underscore.

I hope that then three-letter abbreviations (eg. gug) will be displayed without the full name, because there is no underscore in the name.

kevinhendricks commented 5 years ago

gug is not an official iso639, 2 character language name.

mapreri commented 5 years ago

yes, it's an official iso639-3. 2-char language names are iso639-1.

kevinhendricks commented 5 years ago

But as I said not an official iso639-1 (2 character language name). Yes there are 3 letter and more additional iso specs but we do not use them. Nor does it seem that any other hunspell dictionary.

BeckyDTP commented 5 years ago

After thinking, I think there are two options:

  1. close this issue without making changes

or

  1. simplify and leave for all dictionaries only QString name = dict Then it will be fair - the list will show the names that the user himself gave to the dictionaries. Currently, Spanish and French are treated better than other dictionaries.

There are too many naming options, so instead of multiplying the conditions - the names of dictionary files will appear in the list.

kevinhendricks commented 5 years ago

I think I actually like the last version of the patch with the fallback to just using the dict name. If dictionary makers ignore iso639-1 when naming their dictionary, the defaulting to the dict name makes sense.

I will consider applying this for a future release when things calm down a bit.

kevinhendricks commented 5 years ago

Okay, I modified your patch a bit and added the proper gug 3 letter code (from the iso639-3 list). I pushed it to master. I have not tested it at all.

Please do a pull from today's master and check you list again.

Thanks

BeckyDTP commented 5 years ago

I tried this method yesterday. It's not perfect, because for these files it repeats the name of the main language:

ca -> Catalan
ca-valencia -> Catalan
de_DE -> German
de_AT_frami -> German
de_CH_frami -> German
de_DE_frami -> German
es -> Spanish
es_ANY -> Spanish
sr -> Serbian
sr-Latn -> Serbian
kevinhendricks commented 5 years ago

So you feel it would be better to just use the raw dictionary name?

Perhaps, we do your replace _ - then split the resulting string on "-".

If the result is one part, simply pass it to the language code lookup, if empty go with raw dict name

if the result has two parts or more parts, put the first two back together and lookup, if nothing try with just the first part and if you get something append the remaining unused parts to the name returned. If still nothing, go with raw dict name.

BeckyDTP commented 5 years ago

if the result has two parts or more parts, put the first two back together and lookup, if nothing try with just the first part and if you get something append the remaining unused parts to the name returned. If still nothing, go with raw dict name.

It can work. That's exactly what I meant, but it was hard for me to put on words. I think that this will support the vast majority of existing dictionaries, and for others the name of the dictionary file will be displayed.

kevinhendricks commented 5 years ago

I will code that up tomorrow and push it to master and let you know so you can test it.

On Mar 26, 2019, at 3:18 PM, Becky notifications@github.com wrote:

if the result has two parts or more parts, put the first two back together and lookup, if nothing try with just the first part and if you get something append the remaining unused parts to the name returned. If still nothing, go with raw dict name.

It can work. That's exactly what I meant, but it was hard for me to put on words. I think that this will support the vast majority of existing dictionaries, and for others the name of the dictionary file will be displayed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

kevinhendricks commented 5 years ago

Had a few moments this weekend and pushed what wetalked about above. Please give it a good test and let me know what changes we need yet.

BeckyDTP commented 5 years ago

Possibly add a space " - " when name.append (looks nicer on the list) (Czech-CZ -> Czech - CZ)

For me is OK. From 64 dictionaries only kmr_Latn (Northern Kurdish) is displayed as file name, but you will not add the entire 3-letter list of languages, since 99.9% of such dictionaries do not even exist.

kevinhendricks commented 5 years ago

Will make that change and close the issue tonight, after work.

On Mar 27, 2019, at 4:40 AM, Becky notifications@github.com wrote:

Possibly add a space " - " when name.append (looks nicer on the list) (Czech-CZ -> Czech - CZ)

For me is OK. From 64 dictionaries only kmr_Latn (Northern Kurdish) is displayed as file name, but you will not add the entire 3-letter list of languages, since 99.9% of such dictionaries do not even exist.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.