Replace Spanish suffixes by Portuguese suffixes in the Portuguese snowball stemmer

bongohrtech / lucenenet

Mirror of Apache Lucene.Net

Apache License 2.0

0 stars 0 forks source link

Replace Spanish suffixes by Portuguese suffixes in the Portuguese snowball stemmer #135

Open bongohrtech opened 10 years ago

bongohrtech commented 10 years ago

On PortugueseStemmer.cs[1], there are a few suffixes in the PortugueseStemmer which I believe were copied by mistake from SpanishStemmer[2]:

"log\u00EDas" should be "logias" (line 137)
"log\u00EDa" should be "logia" (line 113)
"uciones" should be "uções" (line 139)
"uci\u00F3n" should be "ução" (line 120)

For more details, see the original report on nltk project:
https://github.com/nltk/nltk/issues/754

[1] https://github.com/apache/lucene.net/blob/master/src/contrib/Snowball/SF/Snowball/Ext/PortugueseStemmer.cs

[2] https://github.com/apache/lucene.net/blob/master/src/contrib/Snowball/SF/Snowball/Ext/SpanishStemmer.cs

JIRA link - [LUCENENET-547] created by he7d3r

bongohrtech commented 7 years ago

This is also the case with Apache Lucene (Java):

https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/analysis/common/src/java/org/tartarus/snowball/ext/PortugueseStemmer.java#L84

I believe the right thing to do for Lucene.NET is leave it as-is, analyzers are expected to behave the same in .NET and Java - and as a by-product that will make indexes readable by both. It is easy enough to create your own analyzer by copying the code and fixing what needs to be fixed. It might make sense to also notify the Apache Lucene project so they can fix it in future releases.

by itamar

bongohrtech commented 7 years ago

Seems to be a reasonable request since its expected for Portuguese to work this way and contributing the fix directly to the Snowball project https://github.com/snowballstem/snowball would literally take years to trickle down to Lucene and then Lucene.Net.

Actually, I have already attempted this. It might work fine. However, this request doesn't have instructions anywhere on how to rework the ZIP file that are used for the tests to verify it works

https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/analysis/common/src/test/org/apache/lucene/analysis/snowball/TestSnowballVocabData.zip

Of course, without altering the ZIP file also (or instructions on how to alter it), the tests for the Portuguese stemmer fail. Any chance you can add that to this request?

by nightowl888