Open jiru opened 4 years ago
Note that with your current setting, "run" gets stemmed, so "running" and "runs" are shown.
https://tatoeba.org/eng/sentences/search?from=eng&trans_to=und&query=run+home&sort=relevance&orphans=
(Note that "ran" is not found, just like other irregular verbs in English aren't part of what gets found by the stemming on tatoeba.org.)
These get stemmed, too.
fix https://tatoeba.org/eng/sentences/search?from=eng&trans_to=und&query=fix&sort=relevance&orphans=
I think it's not so much that "run" gets stemmed despite only having three characters, but that longer words like "running", "runs" get reduced to "run" by the stemmer, so searching for just "run" also turns up results containing those other words.
Since those present tense and imperative form of that Arabic verb all have four characters, they should get stemmed as well. It's just that the Arabic stemmer seems to be not very thorough. On Snowball's demo page, the input
gets turned into the output
The only difference is that أعلم in the second line loses the diacritic on the first letter from the right.
@Yorwba Thanks for double checking; I didn’t know about the demo page. Still, lowering the limit should allow stemming of small words, such as ups → up.
We currently configure Manticore’s indexes with
min_stemming_len = 4
but that value is rather arbitrary and prevents to find words such as علم’s present tense forms (يعلم تعلم نعلم أعلم) or imperative form (اعلم).The documentation says that "gps" would be wrongly stemmed into "gp" but it’s not true with the current English stemmer of Snowball:
Not sure if we want to just go for
min_stemming_len = 1
and hope for the best, or do some more analysis of the impact of reducing the value.Original thread.