en-wl / wordlist

SCOWL (and friends).
http://wordlist.aspell.net
Other
386 stars 78 forks source link

Doesn’t recognize apostrophe (U+2019); possibly solved by László Németh’s affix file changes #77

Closed NilsEnevoldsen closed 9 years ago

NilsEnevoldsen commented 10 years ago

March 9th, 2010: László Németh “fixes” OpenOffice issue 107843 by releasing new English dictionaries as an extension. As with previous versions, they were based on Kevin Atkinson’s word lists. The release notes for this extension included:

2010-03-09 (nemeth AT OOo)

  • UTF-8 encoded dictionary:
    • fix em-dash problem of OOo 3.2 by BREAK
    • suggesting words with typographical apostrophes
    • recognizing words with Unicode f ligatures
  • add phonetic suggestion (Copyright (C) 2000 Björn Jacke, see the end of the file)

I believe this subsequently becomes the bundled dictionary extension in OpenOffice.org.

June 5th, 2013: Marco A.G.Pinto creates a “locally hosted copy” of this extension. It becomes the bundled dictionary extension in Apache OpenOffice.

June 2nd, 2014: Marco A.G.Pinto releases a version of the extension that replaces the old en_US and en_CA dictionaries with 2014 versions. The wordlists are (apparently) still created with speller/make-hunspell-dict, but I notice that the affix files look quite different. It appears as though speller/en.aff hasn’t changed recently, so the old affix files were probably modified by some other source – perhaps László Németh or perhaps a bunch of OpenOffice.org contributors. The new affix files do not contain the fixes to OpenOffice Issue 107843. The affix files in the extension appears to be virtually identical to the one from http://wordlist.aspell.net/dicts/, except that they are UTF-8 instead of ISO-8859-1 (see issue #69?).

To make a long story short, there appeared to be some good changes in the “old” affix file, and I’m not sure whence they came or whether it would be possible to merge them into SCOWL. I happen to be particularly interested in the presence of the rule ICONV ’ ', as it appears that it may fix Chromium issue 165079.

kevina commented 10 years ago

Pull requests to improve the affix file are more than welcome. The affix file in that extension is compiled by running using some sort of tool on the original file. You might want to get into contact with László Németh and see if you can get the original affix before it was compiled, you than might be able to determine how it compares with the one used by SCOWL.

Note that the affix file must be compatible with Aspell for me to use it in SCOWL. This is important because I use it to much the word list. The tool provided with Hunspell uses doesn't get nearly as good as a result as Aspell can get. In fact when I wrote the much-list command in Aspell I designed it to be optimal (although I don't remember what metric I used for this.).

Issue #69 is not necessary related. It would be easy to make a utf-8 Hunspell dictionary if that is what upstream users (Chromium, Firefox, Libre/OpenOffice) want. Please file a separate issue for this.

NilsEnevoldsen commented 10 years ago

In order to add ICONV ’ ' to the affix file, I’d also need to add SET UTF-8, correct? And that would mean that the dictionaries would need to be converted to unicode?

kevina commented 10 years ago

On 08/20/2014 11:47 PM, Nils Enevoldsen wrote:

In order to add |ICONV ’ '| to the affix file, I’d also need to add |SET UTF-8|, correct? I think. I don't know for sure, it might just work if you use the iso-8859-1 encoding. And that would mean that the dictionaries would need to be converted to unicode? Yes. This should be delayed until the final processing step. I think changing line 49 in make-hunspell-dict https://github.com/kevina/wordlist/blob/master/scowl/speller/make-hunspell-dict#L49 to:

cat $1.2 | sort | iconv -f iso-8859-1 -t utf-8 >> $1.dic

will do it.

You will need to look at the large dictionaries in order to check that everything is okay as the normal ones strip accent marks (for example café becomes cafe) and are hence ASCII.

kevina commented 10 years ago

On 08/21/2014 12:02 AM, Kevin Atkinson wrote:

On 08/20/2014 11:47 PM, Nils Enevoldsen wrote:

In order to add |ICONV ’ '| to the affix file, I’d also need to add |SET UTF-8|, correct? I think. I don't know for sure, it might just work if you use the iso-8859-1 encoding. I take that back. I didn't release the character in question is outside the iso-8859-1 range. I forgot Unicode had a special character for the apostrophe and was thinking of U+B4. Sorry.

mwichary commented 9 years ago

I made and verified the necessary change.

This is a PR: https://github.com/kevina/wordlist/pull/103