buda-base / lucene-bo

Lucene analyzer for Tibetan
Apache License 2.0
12 stars 3 forks source link

sanskrit normalization #20

Open eroux opened 5 years ago

eroux commented 5 years ago

For indexing purposes, it might be relevant to do some easy normalization of Sanskrit, mostly having r+geminate be normalized to r+simple consonnant. There are tons of examples in canonical collections, for instance:

eroux commented 5 years ago

There could also be the graphic variants:

eroux commented 5 years ago

as well as either ignoring or normalizing:

eroux commented 3 years ago

perhaps some sandhis like ny+ts -> ny+dz...

and also normalizing ts -> c, tsh -> ch, dz -> j? require a bit of wit as this shouldn't be done for Standard Tibetan