en-wl / wordlist

SCOWL (and friends).
http://wordlist.aspell.net
Other
394 stars 79 forks source link

Aspell flags "doesn’t" #45

Closed kevina closed 10 years ago

kevina commented 10 years ago

Reported by yecril71pl on 2011-01-14 17:29 UTC Orig. from https://sourceforge.net/p/wordlist/issues/45/ One possible replacement being "doesn't". However, I do not want to write "doesn't". I ended up writing "doesn‘t" several times, flagged in Firefox but perfectly good for Aspell…

kevina commented 10 years ago

Commented by kevina on 2011-01-14 19:08 UTC I think this is a Firefox or Hunspell specific issue. doesn't is in the dictionary.

kevina commented 10 years ago

Updated by kevina on 2011-01-14 19:08 UTC

kevina commented 10 years ago

Commented by kevina on 2011-01-14 23:39 UTC This bug report is really confusing. On my browser all three forms of the single quote look almost identical. Here is the problem as I see it: Both Aspell and Firefox/Hunspell only accept: doesn't (' = U+0027) ASCII doesn’t (’ = U+2019) Unicode, correct doesn‘t (‘ = U+2018) Unicode, wrong

You don't want to use doesn't (ASCII) you want to use doesn’t (Unicode) correct?

There is not a simple or clean solution to this problem, and the solution for Aspell and Firefox/Hunspell will likely be different.

(1) Shall we let both in the dictionary? Well then, whenever some one misspells a word with a ' in it both forms will appear which might appear identical to the user (depending on the font). The user will think this is a bug and randomly select one of then, The end result is a mix of ' some ASCII and some Unicode.

Or, (2) Shall we convert the Unicode ' to ASCII? Well the spell checker will accept it, but than if you misspelled "doesn't" it will only offer the ASCII one. Now we can convert the ASCII back to Unicode, but for most users ASCII is the preferred form.

(2) is a better solution, but only if the back conversion (ASCII back to Unicode) is an option selected by the user or somehow automatically detected (harder).

Aspell also has the problem in that it doesn't think the ’ (Unicode) is a valid word character. I will eventually fix this, but it's not a high priority because it will only be a partial solution.

It will let the Hunspell author speak for it.

kevina commented 10 years ago

Updated by kevina on 2011-01-14 23:39 UTC

kevina commented 10 years ago

Commented by yecril71pl on 2011-01-15 00:13 UTC Aspell flags "doesn’t" within KWrite, and also called explicitly given LANG=en_US.utf8; however, in the latter case it does not come up with the suggestion "doesn't" — OTOH, it rejects "doesn‘t" as well (which is good).

kevina commented 10 years ago

Commented by kevina on 2011-01-15 00:20 UTC I guess it depends on who does the tokenization, Aspell's internal tokenizer will treat "doesn’t" (Unicode) as two words.

Also, please explicitly state which ' you are using, otherwise it is very confusing to read. I have to nearly double my font size in order to tell the difference.

kevina commented 10 years ago

Commented by kevina on 2011-01-15 00:30 UTC And it appears that newer versions of Hunspell (at least as used with Firefox) partly implements solution (2) (Chrome rejects doesn’t (Unicode), however). That it it will accept doesn’t (Unicode), but if you misspell it, it will only suggest the ASCII version.

kevina commented 10 years ago

Commented by kevina on 2011-01-15 05:46 UTC

kevina commented 10 years ago

Commented by kevina on 2011-01-15 06:05 UTC Assuming you are using Aspell 0.60, try replacing the existing iso-8850-1.cmap with the attached one. Use "aspell --config data-dir" to find out the location. (Of course back up the original). With this Aspell should accept the Unicode '.

If you want suggestions to always contain the Unicode version than somewhere add the config option "norm-form hack" (see the Aspell manual) or on the command line use "--norm-form=hack". If you spell check in other languages this option will likely break things because, as the name implies, this is a hack and will only work with languages using the iso-8859-1 charset internally. Things might also go wrong if you try to check a non-Unicode document with the option enabled (it will always try to map U+0027 to U+2019, which if doesn't exist in the target charset will than get mapped to '?').

I will try to get the first change in the next version of Aspell 0.60 (which I hope to release sometime before the end of the month). The "hack" norm form, probably won't make it in unless I can figure out how to make it less of a hack (unlikely).

kevina commented 10 years ago

Commented by kevina on 2011-01-18 08:25 UTC I just committed the first part of the last change to Aspell 0.60. I am considering this bug closed.

This bug really belongs with the Aspell project anyway as it not a dictionary issue.

Being able to suggest with the Unicode ', is a seperate issue and should be filed as a Feature Request for Aspell.

kevina commented 10 years ago

Updated by kevina on 2011-01-18 08:25 UTC

kevina commented 10 years ago

Commented by kevina on 2011-01-18 20:24 UTC See https://sourceforge.net/tracker/index.php?func=detail&aid=1732918&group_id=245&atid=350245

Post additional comments there.