buda-base / lucene-bo

Lucene analyzer for Tibetan
Apache License 2.0
12 stars 3 forks source link

phonetics improvements #50

Open eroux opened 1 month ago

eroux commented 1 month ago

Capture d’écran de 2024-10-29 16-03-42

roopeux commented 4 weeks ago

"Dundul Dorje" not matching "bdud 'dul rdo rje" can hopefully be fixed. That would make the correct results rank on top. "Dundul" falsely matching "mthun 'khrul" we probably have to accept. Most users will understand these situations and they should not cause major UX obstacles.

What also affects this ranking is that I demoted _en fields to 25% compared to the other languages, because _en mostly matches the phonetics anyways, and we handle them in a better way. Also the _en data looks less consistent.

eroux commented 1 week ago

also Nyingthik Yabzhi should find snying thig ya bzhi