adyeths / u2o

USFM to OSIS bible format converter.
The Unlicense
18 stars 6 forks source link

Critical XML errors in output #85

Closed DavidHaslam closed 5 years ago

DavidHaslam commented 5 years ago

I just used u2o.py to convert the USFM files for a Belarusian translation.

Something is critically wrong in the XML output for Psalms.

Example:

<verse sID="Ps.10</title>.1" osisID="Ps.10</title>.1" n="1" />На Госпада спадзяюся; як жа вы кажаце душы маёй: "<transChange type="added">Як </transChange> птушка ўзьляці на гару вашу"?
<verse eID="Ps.10</title>.1" />

Attached Zip files provided for testing.

USFM

19-psalm.txt.usfm.zip

OSIS

Psalms.Bela.osis.zip

Notes:
DavidHaslam commented 5 years ago

I have therefore also requested clarification from UBSICAP regarding the placement of \rem in context.

See issue 83.

DavidHaslam commented 5 years ago

Further symptoms:

<chapter sID="Ps.25</title>" osisID="Ps.25</title>" n="25</title>" />

In this instance, the XML artefact is within the attribute in the chapter milestone!

There were 24 such insertions in addition to those within verse milestones.

DavidHaslam commented 5 years ago

Even if the </title> artefacts are all removed, the XML is still not right.

Thus the critical XML error is not merely the insertion of the </title> artefact in the wrong places, but the omission of </title> closures in several places.

Looks as if the problem first arises in connection with the acrostic titles in Ps.9

<title type="acrostic">Алэф
<verse sID="Ps.9.2" osisID="Ps.9.2" n="2" />Славіць буду <transChange type="added">Цябе </transChange> , Госпадзе, усім сэрцам маім, абвяшчаць пра ўсе цуды Твае,
<verse eID="Ps.9.2" />
DavidHaslam commented 5 years ago

Even if the \rem lines are all removed from the SFM file for Psalms, the resulting OSIS file still has XML errors.

USFM (edited)

19-psalm.txt.sfm.zip

OSIS

PsalmsX.Bela.osis.zip

DavidHaslam commented 5 years ago

Looks like the absence of anything that would trigger the start of a poetry line group now causes the incorrect processing of acrostic titles.

See screenshot of WinMerge to compare what happens after I inserted \q after the first such title.

DavidHaslam commented 5 years ago

Significant observations:

adyeths commented 5 years ago

This had nothing to do with \rem lines. Issue is fixed.