adyeths / u2o

USFM to OSIS bible format converter.
The Unlicense
19 stars 6 forks source link

Parsing chapterless references ? #36

Open DavidHaslam opened 6 years ago

DavidHaslam commented 6 years ago

The implementation in ParaTExt 8 for evaluating \xt treats references without book abbreviations as being to the same book. It treats numbers without a chapter / verse separator as being to that chapter in the current book (verseless reference).

There is no support for a chapterless reference (it would be ambiguous with verseless reference -- unless there was a regular syntax to specify a character or abbreviation for verse, like "v13" - which there is not right now).

Some examples of ParaTExt 8 parsing for reference texts (including \io and \r and \xt) are shown in the attached image.

image

DavidHaslam commented 6 years ago

That image illustrated the use and potential problems very well.

This reminds me of something I observed earlier this year while processing references for a Polish Bible translation.

Chapterless verse references were prefixed with either "w. " for a single verse or "ww. " for a verse range or sequence. e.g.

    \x + \xt ww. 13-22.\x*
    \x + \xt w. 1.\x*
    \x + \xt ww. 13.19.28; Ps 50,15; Oz 5,15. \x*

This underscores the fact that the potential syntactical verse prefix is language specific, and therefore should be specified in the locale for the translation language.

Likewise, whether the translators use a period and a space after these two Polish abbreviations (equivalent in meaning to "verse" and "verses") is also a matter of choice!

Aside: That the punctuation for a sequence of verses can be the period rather than the usual comma was also a challenge! How Scripture references are punctuated varies from language to language, and sometimes even between Bible versions in the same language!

DavidHaslam commented 6 years ago

It is necessary to capture quite a bit about the language and project level details for syntax, book names etc.

ParaTExt does provide for this through a Scripture Reference Settings interface, as in the following example.

image2

There is also the Book Names tab (where Abbreviation, Short and Complete book names are specified), and a place to configure how xref origins (\xo) are expressed in the text.

All of this is input into:

  1. validating references, and
  2. generating an inline consistent machine readable form of all vernacular references when the text is exported to USX (XML).

The USFM specification does not explicitly define "rules" for what references mean (they vary so widely), but Paratext implements a specification so that they can be tested, and then exported to a standardized form.

DavidHaslam commented 6 years ago

I'm indebted to Jeff Klassen of UBS ICAP for providing answers to some of my questions.

I trust that the above information may help @adyeths to further develop orefs.py to cover these points.

DavidHaslam commented 6 years ago

It's not yet clear how USFM might be enhanced to support chapterless verse references.

Nevertheless, these are items that we've encountered in the real word, especially in translations that were edited outside the ParaTExt software environment.

I have added a suitable comment in issue 34 for USFM.

DavidHaslam commented 6 years ago

See also #43