buda-base / lucene-bo

Lucene analyzer for Tibetan
Apache License 2.0
12 stars 3 forks source link

ewts indexes? #12

Open eroux opened 7 years ago

eroux commented 7 years ago

It's not very clear how indexes are serialized on disk in terms of char encoding (see there), but it seems to me it could be UTF-8 and not UTF-16. In this case, having indexes in ewts would divide the size of the on-disk indexes by 2. First the situation should be made more clear, but if this is correct, index in ewts should be relatively easy to implement, although they'll make the indexing a bit slower. This could certainly be done after the tokenizer, in a separate filter. It's quite important that the ewts string is first converted into unicode and then back into ewts, so that it's normalized.