Esukhia / tibetan-sort-python

MIT License
0 stars 1 forks source link

Differences with JS version #3

Open BenjaminGalliot opened 1 year ago

BenjaminGalliot commented 1 year ago

I noticed a few differences with the data from the JS version.

JS version:

['ཱ', 'ི', 'ཱི', 'ྀ', 'ཱྀ', 'ུ', 'ཱུ', 'ེ', 'ཻ', 'ོ', 'ཽ']
['ཀ', 'ྈྐ', 'ཫ', 'དཀ', 'བཀ', 'རྐ', 'ལྐ', 'སྐ', 'བརྐ', 'བསྐ']
['ཁ', 'ྈྑ', 'མཁ', 'འཁ']
['ག', 'དགག', 'དགང', 'དགད', 'དགན', 'དགབ', 'དགཝ', 'དགའ', 'དགར', 'དགལ', 'དགས', 'དགི', 'དགུ', 'དགེ', 'དགོ', 'དགྭ', 'དགྱ', 'དགྲ', 'བགག', 'བགང', 'བགད', 'བགབ', 'བགམ', 'བགཾ', 'བགཝ', 'བགའ', 'བགར', 'བགལ', 'བགི', 'བགུ', 'བགེ', 'བགོ', 'བགྭ', 'བགྱ', 'བགྲ', 'བགླ', 'མགག', 'མགང', 'མགད', 'མགབ', 'མགའ', 'མགར', 'མགལ', 'མགི', 'མགུ', 'མགེ', 'མགོ', 'མགྭ', 'མགྱ', 'མགྲ', 'འགག', 'འགང', 'འགད', 'འགན', 'འགབ', 'འགམ', 'འགཾ', 'འགའ', 'འགར', 'འགལ', 'འགས', 'འགི', 'འགུ', 'འགེ', 'འགོ', 'འགྭ', 'འགྱ', 'འགྲ', 'རྒ', 'ལྒ', 'སྒ', 'བརྒ', 'བསྒ']
['ང', 'ྂ', 'ྃ', 'དངག', 'དངང', 'དངད', 'དངན', 'དངབ', 'དངའ', 'དངར', 'དངལ', 'དངི', 'དངུ', 'དངེ', 'དངོ', 'མངག', 'མངང', 'མངད', 'མངན', 'མངབ', 'མངའ', 'མངར', 'མངལ', 'མངི', 'མངུ', 'མངེ', 'མངོ', 'རྔ', 'ལྔ', 'སྔ', 'བརྔ', 'བསྔ']
['ཅ', 'གཅ', 'བཅ', 'ལྕ', 'བལྕ']
['ཆ', 'མཆ', 'འཆ']
['ཇ', 'མཇ', 'འཇ', 'རྗ', 'ལྗ', 'བརྗ']
['ཉ', 'ྋྙ', 'གཉ', 'མཉ', 'རྙ', 'སྙ', 'བརྙ', 'བསྙ']
['ཏ', 'ཊ', 'ཏྭ', 'ཏྲ', 'གཏ', 'བཏ', 'རྟ', 'ལྟ', 'སྟ', 'བརྟ', 'བལྟ', 'བསྟ']
['ཐ', 'ཋ', 'མཐ', 'འཐ']
['ད', 'ཌ', 'གདག', 'གདང', 'གདད', 'གདན', 'གདབ', 'གདམ', 'གདཾ', 'གདའ', 'གདར', 'གདལ', 'གདས', 'གདི', 'གདུ', 'གདེ', 'གདོ', 'གདྭ', 'བདག', 'བདང', 'བདད', 'བདབ', 'བདམ', 'བདཾ', 'བདའ', 'བདར', 'བདལ', 'བདས', 'བདི', 'བདུ', 'བདེ', 'བདོ', 'བདྭ', 'མདག', 'མདང', 'མདད', 'མདན', 'མདབ', 'མདའ', 'མདར', 'མདལ', 'མདས', 'མདི', 'མདུ', 'མདེ', 'མདོ', 'མདྭ', 'འདག', 'འདང', 'འདད', 'འདན', 'འདབ', 'འདམ', 'འདཾ', 'འདཝ', 'འདའ', 'འདར', 'འདལ', 'འདས', 'འདི', 'འདུ', 'འདེ', 'འདོ', 'འདྭ', 'འདྲ', 'རྡ', 'ལྡ', 'སྡ', 'བརྡ', 'བལྡ', 'བསྡ']
['ན', 'ཎ', 'གནག', 'གནང', 'གནད', 'གནན', 'གནབ', 'གནམ', 'གནཾ', 'གནཝ', 'གནའ', 'གནར', 'གནལ', 'གནས', 'གནི', 'གནུ', 'གནེ', 'གནོ', 'གནྭ', 'མནག', 'མནང', 'མནད', 'མནན', 'མནབ', 'མནམ', 'མནཾ', 'མནའ', 'མནར', 'མནལ', 'མནས', 'མནི', 'མནུ', 'མནེ', 'མནོ', 'མནྭ', 'རྣ', 'སྣ', 'བརྣ', 'བསྣ']
['པ', 'ྉྤ', 'དཔག', 'དཔང', 'དཔད', 'དཔབ', 'དཔའ', 'དཔར', 'དཔལ', 'དཔས', 'དཔི', 'དཔུ', 'དཔེ', 'དཔོ', 'དཔྱ', 'དཔྲ', 'ལྤ', 'སྤ']
['ཕ', 'ྉྥ', 'འཕ']
['བ', 'དབག', 'དབང', 'དབད', 'དབན', 'དབབ', 'དབའ', 'དབར', 'དབལ', 'དབས', 'དབི', 'དབུ', 'དབེ', 'དབོ', 'དབྱ', 'དབྲ', 'འབག', 'འབང', 'འབད', 'འབན', 'འབབ', 'འབམ', 'འབཾ', 'འབའ', 'འབར', 'འབལ', 'འབས', 'འབི', 'འབུ', 'འབེ', 'འབོ', 'འབྱ', 'འབྲ', 'རྦ', 'ལྦ', 'སྦ']
['མ', 'ཾ', 'དམག', 'དམང', 'དམད', 'དམན', 'དམབ', 'དམཝ', 'དམའ', 'དམར', 'དམལ', 'དམས', 'དམི', 'དམུ', 'དམེ', 'དམོ', 'དམྭ', 'དམྱ', 'རྨ', 'སྨ']
['ཙ', 'གཙ', 'བཙ', 'རྩ', 'སྩ', 'བརྩ', 'བསྩ']
['ཚ', 'མཚ', 'འཚ']
['ཛ', 'མཛ', 'འཛ', 'རྫ', 'བརྫ']
['ཝ']
['ཞ', 'གཞ', 'བཞ']
['ཟ', 'གཟ', 'བཟ']
['འ']
['ཡ', 'གཡ']
['ར', 'ཪ', 'ཬ', 'བརླ', 'བཪླ']
['ལ']
['ཤ', 'ཥ', 'གཤ', 'བཤ']
['ས', 'གསག', 'གསང', 'གསད', 'གསན', 'གསབ', 'གསའ', 'གསར', 'གསལ', 'གསས', 'གསི', 'གསུ', 'གསེ', 'གསོ', 'གསྭ', 'བསག', 'བསང', 'བསད', 'བསབ', 'བསམ', 'བསཾ', 'བསའ', 'བསར', 'བསལ', 'བསས', 'བསི', 'བསུ', 'བསེ', 'བསོ', 'བསྭ', 'བསྲ', 'བསླ']
['ཧ', 'ལྷ']
['ཨ']
['།', '༎', '༏', '༐', '༑', '༔', '༴', '\u0F0B']

Python version:

['ཀ', 'ྈྐ', 'ཫ', 'དཀ', 'བཀ', 'རྐ', 'ལྐ', 'སྐ', 'བརྐ', 'བསྐ']
['ཁ', 'ྈྑ', 'མཁ', 'འཁ']
['ག', 'དགག', 'དགང', 'དགད', 'དགན', 'དགབ', 'དགཝ', 'དགའ', 'དགར', 'དགལ', 'དགས', 'དགི', 'དགུ', 'དགེ', 'དགོ',  'དགྭ', 'དགྱ', 'དགྲ', 'བགག', 'བགང', 'བགད', 'བགབ', 'བགམ', 'བགཾ', 'བགཝ', 'བགའ', 'བགར', 'བགལ', 'བགི',  'བགུ', 'བགེ', 'བགོ', 'བགྭ', 'བགྱ', 'བགྲ', 'བགླ', 'མགག', 'མགང', 'མགད', 'མགབ', 'མགའ', 'མགར', 'མགལ',  'མགི', 'མགུ', 'མགེ', 'མགོ', 'མགྭ', 'མགྱ', 'མགྲ', 'འགག', 'འགང', 'འགད', 'འགན', 'འགབ', 'འགམ', 'འགཾ',  'འགའ', 'འགར', 'འགལ', 'འགས', 'འགི', 'འགུ', 'འགེ', 'འགོ', 'འགྭ', 'འགྱ', 'འགྲ', 'རྒ', 'ལྒ', 'སྒ', 'བརྒ',  'བསྒ']
['ང', 'ྂ', 'ྃ', 'དངག', 'དངང', 'དངད', 'དངན', 'དངབ', 'དངའ', 'དངར', 'དངལ', 'དངི', 'དངུ', 'དངེ', 'དངོ', 'མངག',  'མངང', 'མངད', 'མངན', 'མངབ', 'མངའ', 'མངར', 'མངལ', 'མངི', 'མངུ', 'མངེ', 'མངོ', 'རྔ', 'ལྔ', 'སྔ', 'བརྔ',  'བསྔ']
['ཅ', 'གཅ', 'བཅ', 'ལྕ', 'བལྕ']
['ཆ', 'མཆ', 'འཆ']
['ཇ', 'མཇ', 'འཇ', 'རྗ', 'ལྗ', 'བརྗ']
['ཉ', 'ྋྙ', 'གཉ', 'མཉ', 'རྙ', 'ཪྙ', 'སྙ', 'བཪྙ', 'བརྙ', 'བསྙ']
['ཏ', 'ཊ', 'ཏྭ', 'ཏྲ', 'གཏ', 'བཏ', 'རྟ', 'ལྟ', 'སྟ', 'བརྟ', 'བལྟ', 'བསྟ']
['ཐ', 'ཋ', 'མཐ', 'འཐ']
['ད', 'ཌ', 'གདག', 'གདང', 'གདད', 'གདན', 'གདབ', 'གདམ', 'གདཾ', 'གདའ', 'གདར', 'གདལ', 'གདས', 'གདི', 'གདུ', 'གདེ',  'གདོ', 'གདྭ', 'བདག', 'བདང', 'བདད', 'བདབ', 'བདམ', 'བདཾ', 'བདའ', 'བདར', 'བདལ', 'བདས', 'བདི', 'བདུ', 'བདེ',  'བདོ', 'བདྭ', 'མདག', 'མདང', 'མདད', 'མདན', 'མདབ', 'མདའ', 'མདར', 'མདལ', 'མདས', 'མདི', 'མདུ', 'མདེ', 'མདོ',  'མདྭ', 'འདག', 'འདང', 'འདད', 'འདན', 'འདབ', 'འདམ', 'འདཾ', 'འདཝ', 'འདའ', 'འདར', 'འདལ', 'འདས', 'འདི', 'འདུ',  'འདེ', 'འདོ', 'འདྭ', 'འདྲ', 'རྡ', 'ལྡ', 'སྡ', 'བརྡ', 'བལྡ', 'བསྡ']
['ན', 'ཎ', 'གནག', 'གནང', 'གནད', 'གནན', 'གནབ', 'གནམ', 'གནཾ', 'གནཝ', 'གནའ', 'གནར', 'གནལ', 'གནས', 'གནི', 'གནུ',  'གནེ', 'གནོ', 'གནྭ', 'མནག', 'མནང', 'མནད', 'མནན', 'མནབ', 'མནམ', 'མནཾ', 'མནའ', 'མནར', 'མནལ', 'མནས', 'མནི',  'མནུ', 'མནེ', 'མནོ', 'མནྭ', 'རྣ', 'སྣ', 'བརྣ', 'བསྣ']
['པ', 'ྉྤ', 'དཔག', 'དཔང', 'དཔད', 'དཔབ', 'དཔའ', 'དཔར', 'དཔལ', 'དཔས', 'དཔི', 'དཔུ', 'དཔེ', 'དཔོ', 'དཔྱ',  'དཔྲ', 'ལྤ', 'སྤ']
['ཕ', 'ྉྥ', 'འཕ']
['བ', 'དབག', 'དབང', 'དབད', 'དབན', 'དབབ', 'དབའ', 'དབར', 'དབལ', 'དབས', 'དབི', 'དབུ', 'དབེ', 'དབོ', 'དབྱ',  'དབྲ', 'འབག', 'འབང', 'འབད', 'འབན', 'འབབ', 'འབམ', 'འབཾ', 'འབའ', 'འབར', 'འབལ', 'འབས', 'འབི', 'འབུ',  'འབེ', 'འབོ', 'འབྱ', 'འབྲ', 'རྦ', 'ལྦ', 'སྦ']
['མ', 'ཾ', 'དམག', 'དམང', 'དམད', 'དམན', 'དམབ', 'དམཝ', 'དམའ', 'དམར', 'དམལ', 'དམས', 'དམི', 'དམུ', 'དམེ', 'དམོ',  'དམྭ', 'དམྱ', 'རྨ', 'སྨ']
['ཙ', 'གཙ', 'བཙ', 'རྩ', 'སྩ', 'བརྩ', 'བསྩ']
['ཚ', 'མཚ', 'འཚ']
['ཛ', 'མཛ', 'འཛ', 'རྫ', 'བརྫ']
['ཞ', 'གཞ', 'བཞ']
['ཟ', 'གཟ', 'བཟ']
['ཞ', 'གཞ', 'བཞ']
['ཡ', 'གཡ']
['ར', 'ཪ', 'ཬ', 'བརླ', 'བཪླ']
['ཤ', 'ཥ', 'གཤ', 'བཤ']
['ས', 'གསག', 'གསང', 'གསད', 'གསན', 'གསབ', 'གསའ', 'གསར', 'གསལ', 'གསས', 'གསི', 'གསུ', 'གསེ', 'གསོ', 'གསྭ',  'བསག', 'བསང', 'བསད', 'བསབ', 'བསམ', 'བསཾ', 'བསའ', 'བསར', 'བསལ', 'བསས', 'བསི', 'བསུ', 'བསེ', 'བསོ',  'བསྭ', 'བསྲ', 'བསླ']
['ཧ', 'ལྷ']
['ཱ', 'ི', 'ཱི', 'ྀ', 'ཱྀ', 'ུ', 'ཱུ', 'ེ', 'ཻ', 'ོ', 'ཽ']
['།', '༎', '༏', '༐', '༑', '༔', '༴', '\u0F0B']

Some glyphs are missing and ['ཞ', 'གཞ', 'བཞ'] is repeated in Python version, and there are some differences…

Capture d’écran du 2023-08-04 13-24-47

I couldn't do a test run comparing the 2 versions, I just wanted to ask if everything was on purpose!

eroux commented 1 year ago

Thanks! This code was lagging behind the JS version, I updated it, thanks for spotting that. Just out of curiosity, in what context do you use this code?

ICU now has good collation rules for Tibetan (see blog post), so it could be better to use it directly

BenjaminGalliot commented 1 year ago

I edit a French-Tibetan dictionary (LuaLaTeX-PDF) for a linguist, and use your Python script between LaTeX compilations to order the entries in the index.

Incidentally, I'm having a few problems with the following letters: བྷ and བྷེ...

In fact, I found your script just after reading this blog post, I had seen the ICU rules (the XML file), but I have to admit that at the moment, I'm not sure how I could use them (it is the first time I need an external tool like this one)...

eroux commented 1 year ago

Oh I see, thanks! Is it with Guillaume Jacques? (I see you've published with him in the past)

You can use the ICU rules with the example provided on https://github.com/eroux/tibetan-collation/blob/master/implementations/Unicode/test.py (which might need a few updates? I'm not sure...)

What problems do you have with བྷ and བྷེ?

BenjaminGalliot commented 1 year ago

Indeed, it could have been with Guillaume, since I'm currently working with him on a new version of the Japhug dictionary, but for today's purposes, it's with Camille Simon!

For བྷ and བྷེ, since I'm not very familiar with the language, it might seem silly.... I had 4 entries that were obviously well-ordered within the others, but as I'm inserting lettrines for the index, I'm using the segments of the 30 or so blocks to make a regular expression that detects the block change and inserts the lettrine (like ^(བ|དབག|དབང|དབད|དབན|དབབ|དབའ|དབར|དབལ|དབས|དབི|དབུ|དབེ|དབོ|དབྱ|དབྲ|འབག|འབང|འབད|འབན|འབབ|འབམ|འབཾ|འབའ|འབར|འབལ|འབས|འབི|འབུ|འབེ|འབོ|འབྱ|འབྲ|རྦ|ལྦ|སྦ) for « བ། », yet I can't find these glyphs in the blocks and so they remain apart since the regular expression can't find them... It works fine for around 15,000 expressions, but 4 starting with these 2 glyphs need to be excluded for the moment...

By the way, I also have to manage the entries starting with numbers separately... They seem to be well-ordered too but I'll have to manually add a regex for them... They seem simpler than the other glyphs, can I directly make a regex like this one: ^(༠|༡|༢|༣|༤|༥|༦|༧|༨|༩)?

Thanks for the link, I'll check it out when I get the chance!

eroux commented 1 year ago

Oh I see, for བྷ maybe it's because you're using the NFC representation (Ux0f57) instead of the NFD one (Ux0f56 Ux0fb7)? If so you should add Ux0f57 to your regex

The numbers should work like that yes

BenjaminGalliot commented 1 year ago

This confirms what I was thinking, as I could see that the computer character seemed a little more complex than the others, already incorporating modifiers while others around seemed more deconstructed! Thanks!