Closed asfimport closed 10 years ago
Adrien Grand (@jpountz) (migrated from JIRA)
Patch:
Robert Muir (@rmuir) (migrated from JIRA)
I think this should be a tokenizer.
Adrien Grand (@jpountz) (migrated from JIRA)
Good point, I'll update the patch to create a ThaiTokenizer so that we can just completely deprecate this filter.
Robert Muir (@rmuir) (migrated from JIRA)
tokenizing from a breakiterator can get a little tricky.
we had some support for this (it should be re-reviewed) in the initial kuromoji integration (SegmentingTokenizerBase.java and its test) But we ended out adding a streaming viterbi search so we didnt need it anymore:
Robert Muir (@rmuir) (migrated from JIRA)
I cut this over to ThaiTokenizer with that base class restored from Kuromoji. The tokenizer itself is simpler now. I think we can use the same approach with SmartChinese.
Robert Muir (@rmuir) (migrated from JIRA)
updated patch: I also cut over smartchinese to use this same approach while we are here.
Ryan Ernst (@rjernst) (migrated from JIRA)
+1, patch lgtm
Is fixing Smart Chinese to not emit punctuation as simple as hardcoding the list of punctuation characters and skipping them in something like incrementWord()?
Robert Muir (@rmuir) (migrated from JIRA)
Its even simpler than that. But i wanted to do that in a followup issue. 4.8 is a good time to fix it, as its easy with this tokenizer!
Simon Willnauer (@s1monw) (migrated from JIRA)
I really like the base class! The patch LGTM +1 to commit
ASF subversion and git services (migrated from JIRA)
Commit 1579846 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1579846
LUCENE-4984: Fix ThaiWordFilter, smartcn WordTokenFilter
ASF subversion and git services (migrated from JIRA)
Commit 1579853 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1579853
LUCENE-4984: actually pass down the AttributeFactory to superclass
ASF subversion and git services (migrated from JIRA)
Commit 1579855 from @rmuir in branch 'dev/branches/branch_4x' https://svn.apache.org/r1579855
LUCENE-4984: Fix ThaiWordFilter, smartcn WordTokenFilter
Uwe Schindler (@uschindler) (migrated from JIRA)
Close issue after release of 4.8.0
ThaiWordFilter is an offender in TestRandomChains because it creates positions and updates offsets.
Migrated from LUCENE-4984 by Adrien Grand (@jpountz), resolved Mar 21 2014 Attachments: LUCENE-4984.patch (versions: 3) Linked issues:
5706
5795