FreeBiblesIndia / Hindi_Bible

Hindi Bible (हिंदी बाइबिल). This work is made available under a Creative Commons Attribution-ShareAlike 4.0 International License.
http://www.freebiblesindia.in/bible/hin/
Other
3 stars 3 forks source link

Unicode Normalization and Devanagari? #5

Open DavidHaslam opened 6 years ago

DavidHaslam commented 6 years ago

Unlike the text in the Assamese Bible, I have found that the Unicode text of the Hindi Bible is already normalized (to NFC).

i.e. None of the following canonically decomposable characters are present in the text.

Rather such letters are in the decomposed form consisting of the corresponding letter plus a Nukta sign.

This prompts the question:

Was this intended? Or has perhaps the source text been normalized inadvertently?

Refer to the Unicode Primary Exclusion List Table

1. Script-specifics: canonically decomposable characters that are generally not the preferred form for particular scripts.

  • These cannot be computed from information in the Unicode Character Database.
  • An example is U+0958 (क़) DEVANAGARI LETTER QA.

cf. A lot hinges on the word "generally", doesn't it?

DavidHaslam commented 6 years ago

Notwithstanding my main question, I have detected a number of instances in the text that would still require correcting:

  1. A letter followed by 2 Nukta signs
  2. A vowel sign followed by a Nukta sign
  3. A Virama sign followed by a Nukta sign

Normally, one would expect the Nukta sign to be immediately after the letter it qualifies.

Each of the these instances might display correctly in some smart fonts, though unless they are "corrected", such locations could easily be excluded from word search results because they have a different order of the diacritics.

In addition to these ordering peculiarities, I have also detected 7 instances where a different letter is unexpectedly followed by a Nukta sign. These are the 4 letters and respective counts:

There is no composite letter which corresponds to each of these four.

Locating these was like looking for a needle in a haystack.