Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.

https://tatoeba.org

GNU Affero General Public License v3.0

724 stars 132 forks source link

Words of 3 chars or less are not stemmed #2378

Open jiru opened 4 years ago

jiru commented 4 years ago

We currently configure Manticore’s indexes with min_stemming_len = 4 but that value is rather arbitrary and prevents to find words such as علم’s present tense forms (يعلم تعلم نعلم أعلم) or imperative form (اعلم).

The documentation says that "gps" would be wrongly stemmed into "gp" but it’s not true with the current English stemmer of Snowball:

sphinxQL> select id from eng_main_index where MATCH('gps'); show meta;
Empty set (0.002 sec)

+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total         | 0     |
| total_found   | 0     |
| time          | 0.000 |
| keyword[0]    | gps   |
| docs[0]       | 0     |
| hits[0]       | 0     |
+---------------+-------+

Not sure if we want to just go for min_stemming_len = 1 and hope for the best, or do some more analysis of the impact of reducing the value.

Original thread.

ckjpn commented 4 years ago

Note that with your current setting, "run" gets stemmed, so "running" and "runs" are shown.

https://tatoeba.org/eng/sentences/search?from=eng&trans_to=und&query=run+home&sort=relevance&orphans=

(Note that "ran" is not found, just like other irregular verbs in English aren't part of what gets found by the stemming on tatoeba.org.)

These get stemmed, too.

fix https://tatoeba.org/eng/sentences/search?from=eng&trans_to=und&query=fix&sort=relevance&orphans=

pig https://tatoeba.org/eng/sentences/search?from=eng&trans_filter=limit&query=pig&sort=relevance&trans_to=und

Yorwba commented 4 years ago

I think it's not so much that "run" gets stemmed despite only having three characters, but that longer words like "running", "runs" get reduced to "run" by the stemmer, so searching for just "run" also turns up results containing those other words.

Since those present tense and imperative form of that Arabic verb all have four characters, they should get stemmed as well. It's just that the Arabic stemmer seems to be not very thorough. On Snowball's demo page, the input

علم

يعلم تعلم نعلم أعلم

اعلم

gets turned into the output

علم

يعلم تعلم نعلم اعلم

اعلم

The only difference is that أعلم in the second line loses the diacritic on the first letter from the right.

jiru commented 4 years ago

@Yorwba Thanks for double checking; I didn’t know about the demo page. Still, lowering the limit should allow stemming of small words, such as ups → up.