christiansteinert / tibetan-dictionary

49 stars 5 forks source link

results are sorted in ascii, not in Tibetan collation order #15

Open nbuwe opened 1 year ago

nbuwe commented 1 year ago

Search for las, observe the list of results that goes like: las, las 'bo, ..., las 'bras, ..., las 'byung, ..., las 'char, ... - which is obviously sorted in ascii order

christiansteinert commented 1 year ago

Indeed. At the moment the sorting is based on ASCII order of the Wylie-transliterated Tibetan, the order is not based on proper Tibetan dictionary order. This is the lazy way how the sorting works for now but I agree that it is not correct.

Changing this without negatively impacting performance will require some changes to the database and especially the process of creating that database. But I agree that it would be good to improve this at some point.

Possible solution (note for later):

nbuwe commented 1 year ago

Instead of global order, that generally speaking depends on every key in the database, you can use sort key that depends only on the key itself. Effectively - a custom strxfrm(3). ISTR, ICU might have an implementation already.

nbuwe commented 1 year ago

Cf. e.g. eroux/tibetan-collation

christiansteinert commented 1 year ago

Doing a transform into a different representaton that can be used for collation would be an option. But in ICU I don't see an implementation for that at first glance, only an algorithm for the sorting itself by describing the relative order of various tokens. Also, I want to keep the database as compact as possible for the mobile application so although basing the order on the entire content of the database may seem somewhat inelegant, it would yield a numeric criterion for ordering and would therefore not require another string-column for each term to represent sort order.

christiansteinert commented 1 year ago

Correction, ICU may have something like that after all, as shown here: https://unicode-org.github.io/icu/userguide/collation/architecture#sort-key-size But I am still not sure it is really easier to implement and it is definitely a lot less compact.

nbuwe commented 1 year ago

With "strxfrm" sorting key you don't need to store them. Typical search results is rarely more than a few dozens of words. Computing strxfrm sort keys for a hundred strings is not expensive.

christiansteinert commented 1 year ago

I am not yet sure if I want to give up pagination. I will think about it. Thanks a lot for your input, it is greatly appreciated!