aymkam / lucene-gosen

Automatically exported from code.google.com/p/lucene-gosen
GNU Lesser General Public License v2.1
0 stars 0 forks source link

Surrogates Pair character analyze failed. #15

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Surrogates Pair character analyze.

return token is separated character(high and low surrogate).

Original issue reported on code.google.com by johtani on 5 Jul 2011 at 9:47

GoogleCodeExporter commented 8 years ago
It is necessary to check whether there is not a problem in some phases.

1. DictionaryBuilder and Preprocessor
2. Viterbi 
3. TrieBuilder and TrieSearcher
4. StreamFilter(ex. CompositeTokenFilter...)

Original comment by johtani on 8 Jul 2011 at 2:39

GoogleCodeExporter commented 8 years ago
This sounds bad, can we come up with some any tests? With some tests, it should 
be easy to fix the issue.

We should never split high/low surrogate characters ever.

Original comment by rcm...@gmail.com on 25 Oct 2011 at 1:57

GoogleCodeExporter commented 8 years ago
Promptly, one test was written. (src/test/net/java/sen/SurrogatesPairTest.java)
Expected token's cost may differ from the value actually outputted. 

Original comment by johtani on 31 Oct 2011 at 10:52

Attachments: