ThomHehl / Moffatt

The Moffat Bible
1 stars 1 forks source link

Recording page numbers in USFM and XML #9

Open DavidHaslam opened 6 years ago

DavidHaslam commented 6 years ago

I observe that the XML files contain a record of where a page break occurs together with the next page number, e.g.

<milestone type="pb" n="62"/>

There's no defined marker for this in USFM.

The best option would be to use the \rem tag to record the page number, thus:

\rem page 62

NB. Though USFM does have tag \pb for an explicit page break, we're not trying to specify how future editions should be typeset, but merely recording FIO where page breaks occurred in the original.

btw. u2o.py converts each such remark into a milestone element, though not exactly the same as what you have already.

DavidHaslam commented 6 years ago

See also #10 where the column break splits a word that has both a hyphen and a soft hyphen.

DavidHaslam commented 6 years ago

Although the \rem tag seemed sensible, it causes serious problems in USFM where the page break occurs in the middle of a verse, or even in the middle of a word!

For this reason, I changed to use a footnote marker \f - ... \f* to record page numbers and column breaks. This can be used mid-verse without much hassle.

The minus sign signifies that there is no caller symbol for the note. cf. More commonly, USFM footnotes would have a plus sign. Refer to the USFM 2.4 User Reference.

However, there may be a problem when the USFM footnote occurs before chapter 1 of any book.

cmahte commented 6 years ago

The use in digital form will be to be able to refer to a physical page from the print edition by a link?

A cross reference type marker might be a better, slightly more appropriate hack than a footnote, because it will automatically generate a link target designed to be linked in from elsewhere when being processed, where the footnotes will generate links only designed to link out.

I suggest using the form

*\x - \xo (current verse) \xt (current verse) \xta Page number \x* and possibly using a different form \ex ... \ex or \fe ... \fe instead of \x ... \x to be able to keep these page numbers separate from visible cross references, if they exist.

You might also read about the explicitly marked notes as an option:

\f (and \ef, \ex, \fdc, \fe, \x, \xdc, \xnt, \xot ....) all have three forms for the marker argument

+ = software created visible marker inline - = no marker visible inline (anything else) = text appears here is the visible marker that appears inline. However, the explicit reading of the spec is '(singular) character'. In practice, I've noted some exceptions, and 2 character markers have been used. The argument is therefore delimited on the space and might take more than 2 characters.)

The problem with explicit markers in most SFM processors is that the code may not accept multiple digits, and will not accept a space.

\x 254 \xt ... \x : the 54 might not be recognised. \x Page 254 \xt ... \x* : (will present a P or a Page and almost certainly error on the digits.) \x Page.254 \xt ... \x** : everything after the P might not be recognized.

On Fri, Jan 5, 2018 at 9:55 AM, David Frank Haslam <notifications@github.com

wrote:

Although the \rem tag seemed sensible, it causes serious problems in USFM where the page break occurs in the middle of a verse, or even in the middle of a word!

For this reason, I changed to use a footnote marker \f - ... \f* to record page numbers and column breaks. This can be used mid-verse without much hassle.

The minus sign signifies that there is no caller symbol for the note. cf. More commonly, USFM footnotes would have a plus sign. Refer to the USFM 2.4 User Reference.

However, there may be a problem when the USFM footnote occurs before chapter 1 of any book.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ThomHehl/Moffatt/issues/9#issuecomment-355589370, or mute the thread https://github.com/notifications/unsubscribe-auth/ALQyclSCW1b_ESpJF2BNt3Zh_YwSXO6eks5tHkYOgaJpZM4RRhIr .

DavidHaslam commented 6 years ago

The page and column breaks in the OSIS should not be links.

OSIS 2.1.1 defines a milestone type for these.

The USFM tags for remarks do at least convert to milestone elements, albeit not in the prescribed format.

However, USFM footnotes or cross-references convert to note elements in OSIS.

UsIng USFM was not part of Thom's original digitisation plan. I suggested USFM as an expedient because

  1. it's easier to write, and
  2. there is a straightforward conversion script called u2o.py

Also, it has since become apparent, Thom's OSIS for the first 12 books digitised has errors and other inconsistencies.

I would have preferred to use a general milestone tag in USFM, but that does not yet exist.

Even so, it's something I had already proposed to ICAP for USFM 3.1 (too late for 3.0).

DavidHaslam commented 6 years ago

I should add that my switch from \rem to using footnote markers is definitely a "kludge".

The problem with \rem is that it has to be a line of its own in USFM.

It's not a character level marker. You can't use it conveniently in the middle of a word!

Yet that's where column breaks and page breaks can occur.

After converting USFM to OSIS, further postprocessing of these will be required.

The advantage of \rem is that it is allowed before any tags that determine displayed content. You can even have it before the \mt1 book title. And, unsurprisingly, that's where the first page number occurs in Genesis.

But that does not mitigate the bigger problem of mid-word page or column breaks.