Rainie3535 / sigil

Automatically exported from code.google.com/p/sigil
GNU General Public License v3.0
0 stars 0 forks source link

Spell check should support other languages when tokenizing words #1178

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This issue can be demonistrated with French. An ' should be considered as a 
word boundary. This is not happening as the tokenization takes into account 
English boundaries where ' should be considered part of a word. See 
https://bugzilla.mozilla.org/show_bug.cgi?id=415254 and 
http://www.mobileread.com/forums/showpost.php?p=1916976&postcount=11 for more 
details. Also, see 
http://www.mobileread.com/forums/showpost.php?p=1917535&postcount=20 for sample 
text.

Original issue reported on code.google.com by john@nachtimwald.com on 15 Jan 2012 at 10:47

GoogleCodeExporter commented 9 years ago
Shouldn't "-" be considered a word-boundary, too? There are many texts which 
don't use the entity "—", but simply put a "-" without any space between the 
words. Both words left and right of the "-" are underlined red, which makes 
proof-reading difficult.

Example:
"And since it burned, we don't have much from it—just these few papers"
The "it—just" is underlined.

Original comment by frank.ki...@gmail.com on 17 Jan 2012 at 10:25

GoogleCodeExporter commented 9 years ago
Another character, that should *always* work as a word-boundary, is "U+2014", 
the unicode encoding of the mdash character.
I guess there are a lot more missing boundarys, but the variations of the 
hyphen-character are probably the most frequent examples in English...

Original comment by frank.ki...@gmail.com on 17 Jan 2012 at 12:48

GoogleCodeExporter commented 9 years ago
I'm not listing every current word boundary character here because they are in 
the code. ' is an example of how a character can be a boundary in one language 
but not in another.

Original comment by john@nachtimwald.com on 17 Jan 2012 at 12:56

GoogleCodeExporter commented 9 years ago
Similar to this, a smart single quote is causing issues. Example:

Aegon’s

is underlined, but if I add it to the dictionary, it doesn't recognize it. A 
normal single quote works OK.

Original comment by jesse.ma...@gmail.com on 24 Apr 2012 at 10:53

GoogleCodeExporter commented 9 years ago

Original comment by daveheil...@gmail.com on 18 Sep 2012 at 7:34

GoogleCodeExporter commented 9 years ago

Original comment by daveheil...@gmail.com on 18 Sep 2012 at 7:35

GoogleCodeExporter commented 9 years ago

Original comment by daveheil...@gmail.com on 1 Oct 2012 at 6:53

GoogleCodeExporter commented 9 years ago

Original comment by daveheil...@gmail.com on 16 Feb 2013 at 2:43

GoogleCodeExporter commented 9 years ago

Original comment by john@nachtimwald.com on 24 Jul 2013 at 10:06

GoogleCodeExporter commented 9 years ago

Original comment by john@nachtimwald.com on 3 Nov 2013 at 2:26