sanskrit normalization - Githubissues

buda-base / lucene-bo

Lucene analyzer for Tibetan

Apache License 2.0

12 stars 3 forks source link

sanskrit normalization #20

Open eroux opened 5 years ago

eroux commented 5 years ago

For indexing purposes, it might be relevant to do some easy normalization of Sanskrit, mostly having r+geminate be normalized to r+simple consonnant. There are tons of examples in canonical collections, for instance:

རྨྨ --> རྨ
རྦྦ -> རྦ
རྒྒ -> རྒ
etc.

eroux commented 5 years ago

There could also be the graphic variants:

0FB0 --> 0F71
0FBB --> 0FB1
0FBC --> 0FB2
0FBA --> 0FAD
0F6A --> 0F62

eroux commented 5 years ago

as well as either ignoring or normalizing:

0f7e
0f82
0f83
0f86

eroux commented 3 years ago

perhaps some sandhis like ny+ts -> ny+dz...

and also normalizing ts -> c, tsh -> ch, dz -> j? require a bit of wit as this shouldn't be done for Standard Tibetan