apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

Fix ThaiWordFilter [LUCENE-4984] #6048

Closed asfimport closed 10 years ago

asfimport commented 11 years ago

ThaiWordFilter is an offender in TestRandomChains because it creates positions and updates offsets.


Migrated from LUCENE-4984 by Adrien Grand (@jpountz), resolved Mar 21 2014 Attachments: LUCENE-4984.patch (versions: 3) Linked issues:

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Patch:

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think this should be a tokenizer.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Good point, I'll update the patch to create a ThaiTokenizer so that we can just completely deprecate this filter.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

tokenizing from a breakiterator can get a little tricky.

we had some support for this (it should be re-reviewed) in the initial kuromoji integration (SegmentingTokenizerBase.java and its test) But we ended out adding a streaming viterbi search so we didnt need it anymore:

http://svn.apache.org/viewvc?view=revision&revision=1230748

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I cut this over to ThaiTokenizer with that base class restored from Kuromoji. The tokenizer itself is simpler now. I think we can use the same approach with SmartChinese.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

updated patch: I also cut over smartchinese to use this same approach while we are here.

asfimport commented 10 years ago

Ryan Ernst (@rjernst) (migrated from JIRA)

+1, patch lgtm

Is fixing Smart Chinese to not emit punctuation as simple as hardcoding the list of punctuation characters and skipping them in something like incrementWord()?

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Its even simpler than that. But i wanted to do that in a followup issue. 4.8 is a good time to fix it, as its easy with this tokenizer!

asfimport commented 10 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I really like the base class! The patch LGTM +1 to commit

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1579846 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1579846

LUCENE-4984: Fix ThaiWordFilter, smartcn WordTokenFilter

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1579853 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1579853

LUCENE-4984: actually pass down the AttributeFactory to superclass

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1579855 from @rmuir in branch 'dev/branches/branch_4x' https://svn.apache.org/r1579855

LUCENE-4984: Fix ThaiWordFilter, smartcn WordTokenFilter

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Close issue after release of 4.8.0