Open eroux opened 5 years ago
There could also be the graphic variants:
as well as either ignoring or normalizing:
perhaps some sandhis like ny+ts
-> ny+dz
...
and also normalizing ts -> c
, tsh -> ch
, dz -> j
? require a bit of wit as this shouldn't be done for Standard Tibetan
For indexing purposes, it might be relevant to do some easy normalization of Sanskrit, mostly having r+geminate be normalized to r+simple consonnant. There are tons of examples in canonical collections, for instance:
རྨྨ
-->རྨ
རྦྦ
->རྦ
རྒྒ
->རྒ