UniStuttgart-VISUS / damast

Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World" (VolkswagenFoundation)
MIT License
10 stars 1 forks source link

Consider the Arabic ʿain letter for sorting place names #94

Closed mfranke93 closed 2 years ago

mfranke93 commented 2 years ago

@tutebatti 's comment here regarding the Arabic letter 'ain made me realize we are not 100% on the same page regarding that. At the moment, this is not ignored for sorting, only the "normal" ASCII apostrophe (U+0027 APOSTROPHE) and U+2019 RIGHT SINGLE QUOTATION MARK (’) are (see here and here).

As far as I know, the 'ain is not used in the primary place names (we discussed this some months or years ago), which is why we only use those characters. If you want, I could add the U+02BB ʻ MODIFIER LETTER TURNED COMMA and U+02BF ʿ MODIFIER LETTER LEFT HALF RING (both used for 'ain) characters as well. Let me know.

tutebatti commented 2 years ago

At best, in the data itself, there should be only one Unicode sign for the simplified form (U+0027) and one for the "scientific" transcription (U+02BF ʿ MODIFIER LETTER LEFT HALF RING). If this is consistent in the data, we can decide on the sorting.

Then, I would tend to treat both signs the same, namely neglect them when sorting, because we use the "Latin" sorting logic for the transcriptions now.

At any rate, I need to double-check this with the others.

tutebatti commented 2 years ago

At best, in the data itself, there should be only one Unicode sign for the simplified form (U+0027) and one for the "scientific" transcription (U+02BF ʿ MODIFIER LETTER LEFT HALF RING). If this is consistent in the data, we can decide on the sorting.

Also, there should be one more Unicode sign, namely U+02BE ʾ MODIFIER LETTER RIGHT HALF RING for the Arabic hamzah; its simplified transcription is the same as the one of ʿain, namely U+0027.

@rpbarczok is checking for consistency and will report back.

tutebatti commented 2 years ago

Apart from the consistency of the data, which we will take care of in the coming days, we decided to exclude the following signs from sorting, which should generally follow the Latin order as discussed.

This might affect the Armenian scientific transcription as well, which uses U+02BF as well, but that should not cause any problems.

@rpbarczok, please confirm.

rpbarczok commented 2 years ago

confirmed