buda-base / lucene-bo

Lucene analyzer for Tibetan
Apache License 2.0
12 stars 3 forks source link

stacking normalization #29

Closed eroux closed 2 years ago

eroux commented 3 years ago

currently padma and pad+ma are different (resp. པདམ and པདྨ), we should find a way to harmonize this. This is not particularly easy in the general case and probably requires some tweaks in the ewts sloppy conversion, but it would be reasonable to hardcode at least this example