apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

ThaiAnalyzer fail to tokenize word. [LUCENE-4253] #5323

Open asfimport opened 12 years ago

asfimport commented 12 years ago

Method protected TokenStreamComponents createComponents(String,Reader)

return a component that unable to tokenize Thai word. The current return statement is: return new TokenStreamComponents(source, new StopFilter(matchVersion, result, stopwords));

My experiment is change the return statement to: return new TokenStreamComponents(source, result);

It give me a correct result.


Migrated from LUCENE-4253 by Nattapong Sirilappanich, updated Jul 30 2012 Environment:

Windows 7 SP1.
Java 1.7.0-b147
asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

One note: "Java 1.7.0-b147"

Don't use that old Java version, it is broken with Lucene and creates corrupt indexes!!! Upgrade to at least (as absolute minimum) to Java 7u1, see http://blog.thetaphi.de/2011/07/real-story-behind-java-7-ga-bugs.html

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I am confused what the problem is here, but it seems like you don't want to remove stopwords.

If you dont want to remove stopwords just create a ThaiAnalyzer, passing CharArraySet.EMPTY_SET as the stopwords parameter.

asfimport commented 12 years ago

Nattapong Sirilappanich (migrated from JIRA)

Hi Robert,

Based on your suggestion, i found the actual problem. The problem is "stopwords.txt" in package "org.apache.lucene.analysis.th" contain a lot of words that is stop words for a specific type of usage. The only type of usage is already stated inside the file. And based on the javadoc, since Lucene 3.6, these words are being used by default.

In my opinion, these set of words shall not be used by default.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi Nattapong: do you want to help us clean it up?

I don't like how long the list is.

Perhaps we can use a smaller list (e.g. the list in http://researchrepository.murdoch.edu.au/7764/2/02Whole.pdf)?

asfimport commented 12 years ago

Nattapong Sirilappanich (migrated from JIRA)

Hi Reobrt,

Stop words will only be useful when it is able to deal with correct tokenization. The problem, as stated in the thesis, is the tokenization process can never give a 100% correct result by any todate technology.

I'd give it a try for the approach in the thesis but it'd be risky if it doesn't deliver what it promised in thesis. My preference now is to use no stop word at all to avoid potential problems.

An example problem is a word "คงอยู่" (Two syllables Thai word mean persisting and surviving). It will be segmented into "คง" (mean may, might and probably in English) and "อยู่" (mean stay, live and reside in English). By using the existing stop word, there is no way to find this word. By using the new stop words in the thesis, the term "คง" is the only way to find the word which is not going to make sense. How come the word which mean "might" return a result with the word meaning "survive" ?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Right but having less than 100% segmentation isnt unique to thai (it happens in many other languages too).

Its always a tradeoff: if those measurements are correct and 30% of typical thai text is stopwords, then its a pretty significant performance (and often relevance) degradation to keep all stopwords.

In general these list are useful, someone can also choose to use them with commongrams filter for maybe an even better tradeoff. Thats why I think its good to keep them (of course as short and minimal as possible).

If someone doesnt mind the downsides, you can always pass CharArraySet.EMPTY_SET parameter as I mentioned before.

asfimport commented 12 years ago

Nattapong Sirilappanich (migrated from JIRA)

I see your point. However, it is harder than it look. Correct me if i'm wrong.

As stated in the thesis itself: This makes retrieval and proper recognition of the documents which contain the phrase "SOME THAI PHRASE" almost impossible.

It is because Thai text may construct a word from many stop words it that list. Without better tokenzier, such word will disappear from index.

I don't have a chance to view the thesis that research over those stop words. In my own opinion, the only set of words that shall not cause a truncated is a set of conjunction words.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This makes retrieval and proper recognition of the documents which contain the phrase "SOME THAI PHRASE" almost impossible.

Right, i just mean the problem is general. 'a' is a english stopword but this can screw up things like 'L.A.' (Los Angeles) and other "terms", because of how they are tokenized. This is just a common tradeoff.

Its just that with the current Thai tokenizer and the overly aggressive list, its much more pronounced.

I don't have a chance to view the thesis that research over those stop words. In my own opinion, the only set of words that shall not cause a truncated is a set of conjunction words.

Yes, this paper seems difficult to get a hold of.

But I think its definitely a good idea should try to reduce the current list to not be so large. It should be less aggressive.

asfimport commented 12 years ago

Nattapong Sirilappanich (migrated from JIRA)

But I think its definitely a good idea should try to reduce the current list to not be so large. It should be less aggressive.

I think trying with something that was researched is a good idea. Please proceed with the new list. Thanks for spending time discuss a Thai specific problem here.