adyeths / u2o

USFM to OSIS bible format converter.
The Unlicense
19 stars 6 forks source link

Scripture references in RtoL scripts #41

Closed DavidHaslam closed 6 years ago

DavidHaslam commented 6 years ago

Working with the translator of an Urdu Bible, some important points have emerged about Scripture references.

In order to correctly display the human readable references, judicious use of the RLM is required U+200F.

See Marking references in Right to Left scripts.

To ensure that the human readable reference is correctly displayed in Bibles with a Right to Left script, the translator[s] may have made judicious use of the special Unicode character RIGHT TO LEFT MARK [RLM] (U+200F).

The RLM being invisible, its presence may easily go unnoticed, yet script developers need to be aware of it. The procedure to convert the human readable reference to the machine readable osisRef value must ensure that the RLM is deleted from the latter.

DavidHaslam commented 6 years ago

The screenshot from Xiphos 4.0.6a (Windows) illustrates a footnote in 2 Kings 12:4.

screenshot 2017-12-09 21 26 42

Apart from the undesirable left alignment in the preview pane, the footnote is correctly displayed for being read from Right to Left.

DavidHaslam commented 6 years ago

Here's the note element in the OSIS source:

<note placement="foot"><reference type="annotateRef" osisRef="2Kgs.12.4">‏12‏:4</reference> <catchWord>مردم شماری کے ٹیکس: </catchWord>دیکھئے <seg type="x-nested"><reference osisRef="Exod.11.16-Exod.11.30">خروج 11‏:16‏-30</reference></seg> </note>

Here's the same Unicode text converted to NCR.

<note placement="foot"><reference type="annotateRef" osisRef="2Kgs.12.4">&#x200F;12&#x200F;:4</reference> <catchWord>&#x0645;&#x0631;&#x062F;&#x0645; &#x0634;&#x0645;&#x0627;&#x0631;&#x06CC; &#x06A9;&#x06D2; &#x0679;&#x06CC;&#x06A9;&#x0633;: </catchWord>&#x062F;&#x06CC;&#x06A9;&#x06BE;&#x0626;&#x06D2; <seg type="x-nested"><reference osisRef="Exod.11.16-Exod.11.30">&#x062E;&#x0631;&#x0648;&#x062C; 11&#x200F;:16&#x200F;-30</reference></seg> </note>

Observe the four instances of &#x200F; which is the RLM that I described above.

adyeths commented 6 years ago

With regards to converting usfm to osis, I'm not sure that the presence of these marks matters.

The only place where they would be problematic is with creating osisRef attributes in references. (And I can strip them out easily enough in orefs.) Additionally, are all references in right to left languages going to be written like this:

            laterverse-verse:chapter bookname

if so, I will need to make adjustments in orefs to accomodate this format.

DavidHaslam commented 6 years ago

The key to understanding this is the exact placements of the RLM.

But yes, it's for your orefs.py where the possibility may be encountered.

I've been associated with CrossWire for over eight years. In all that time, AFAIK, nobody had made such observations in writing.

We don't have a huge number of RtoL Bibles, and even some of these have no notes with caller references or cross-reference notes.

I guess it's something that even the ParaTExt team may not have ever considered in detail.

Arabic has no COLON even though it has a COMMA and a FULL-STOP, as well as a SEMICOLON.

btw. One of the 354 notes in UrduGeo has the Arabic Comma as the verse,verse separator.

DavidHaslam commented 6 years ago

Even for RtoL scripts, the references are written

but the insertion of the RLMs makes the numerical parts look "back to front" to you and me.

BabelPad, the Unicode text editor for Windows developed by Andrew West, has features that help users see the order of the codepoints.

DavidHaslam commented 6 years ago

And of course, the Arabic Comma has no need for a RLM before it, as it's already a RtoL character.

DavidHaslam commented 6 years ago

One possible way to deal with (e.g.) this Urdu Bible, would be to define two of the punctuation variables to include the RLM.

SEPM = "\u061B"  # separates multiple references (Arabic semicolon)
SEPC = "\u200F:"  # separates chapter from verse (RLM + colon)
SEPP = "\u060C"  # separates multiple verses or verse ranges (Arabic comma)
SEPR = "\u200F-"  # separates verse ranges (RLM + hyphen/minus)

This still leaves the observation that any annotateRef references would have an RLM before the chapter number, assuming the translator[s] managed to get things done right.

NB. As it happens, this particular project (as yet) had no need for SEPM.

Thus to some extent, some of the adaptation can be readily done by the user.

This would leave the task of removing the RLM when a original reference is converted to the osisRef value for such annotateRef reference types.

Maybe worth to consider defining a further variable that carries this property?

SEPA = "\u200F" # defines the start of an annotate reference (RLM)

cf. For LtoR scripts this would just be the null string.

adyeths commented 6 years ago

No additional variables are needed. If the references are always written left to right regardless of language direction, then the only thing I would need to do is filter out the unicode directional formatting characters when generating the osisRef attribute for the references. An easy thing to do.

DavidHaslam commented 6 years ago

Great. Easy to do quite soon?

adyeths commented 6 years ago

Done.