Open DavidHaslam opened 6 years ago
Notwithstanding my main question, I have detected a number of instances in the text that would still require correcting:
Normally, one would expect the Nukta sign to be immediately after the letter it qualifies.
Each of the these instances might display correctly in some smart fonts, though unless they are "corrected", such locations could easily be excluded from word search results because they have a different order of the diacritics.
In addition to these ordering peculiarities, I have also detected 7 instances where a different letter is unexpectedly followed by a Nukta sign. These are the 4 letters and respective counts:
U+090F 1 ए़ E
U+0918 3 घ़ GHA
U+092A 2 प़ PA
U+0932 1 ल़ LA
There is no composite letter which corresponds to each of these four.
Locating these was like looking for a needle in a haystack.
Unlike the text in the Assamese Bible, I have found that the Unicode text of the Hindi Bible is already normalized (to NFC).
i.e. None of the following canonically decomposable characters are present in the text.
U+0958 क़ QA
U+0959 ख़ KHHA
U+095A ग़ GHHA
U+095B ज़ ZA
U+095C ड़ DDDHA
U+095D ढ़ RHA
U+095E फ़ FA
U+095F य़ YYA
Rather such letters are in the decomposed form consisting of the corresponding letter plus a Nukta sign.
This prompts the question:
Refer to the Unicode Primary Exclusion List Table
cf. A lot hinges on the word "generally", doesn't it?