adyeths / u2o

USFM to OSIS bible format converter.
The Unlicense
19 stars 6 forks source link

Mishandling of the acrostic heading \qa tag #52

Closed DavidHaslam closed 6 years ago

DavidHaslam commented 6 years ago

Using the latest u2o.py downloaded yesterday, I just encountered a new critical bug.

Psalm 119 acrostic tags \qa were not properly closed in the XML by </title>.

<chapter sID="Ps.119" osisID="Ps.119" n="119" />
<!-- cl --><milestone type="x-chapterLabel" n="SALMO 119" />
<title type="acrostic">ALEF.
<verse sID="Ps.119.1" osisID="Ps.119.1" n="1" />Bienaventurados los perfectos de camino: los que andan en la ley de Jehová.
<verse eID="Ps.119.1" />

Likewise for each stanza.

Psalm 120 ended up being badly corrupted with </title> inserted multiple times within each <chapter sID ... /> and <verse sID ... /> element.

<verse sID="Ps.119.176" osisID="Ps.119.176" n="176" />Yo me perdí, como oveja que se pierde: busca a tu siervo, porque no me he olvidado de tus mandamientos.
<verse eID="Ps.119.176" />
<chapter eID="Ps.119" />
<chapter sID="Ps.120</title>" osisID="Ps.120</title>" n="120</title>" />
<!-- cl --><milestone type="x-chapterLabel" n="SALMO 120" />
<!-- d --><title type="psalm" canonical="true">Canción de las gradas.
<verse sID="Ps.120</title>.1" osisID="Ps.120</title>.1" n="1" />A JEHOVÁ llamé estando en angustia; y él me respondió.
<verse eID="Ps.120</title>.1" />
<verse sID="Ps.120</title>.2" osisID="Ps.120</title>.2" n="2" />Jehová, escapa mi alma del labio mentiroso: de la lengua engañosa.
<verse eID="Ps.120</title>.2" />
<verse sID="Ps.120</title>.3" osisID="Ps.120</title>.3" n="3" />¿Qué <transChange type="added">te</transChange> dará a ti, o qué te añadirá la lengua engañosa?
<verse eID="Ps.120</title>.3" />
<verse sID="Ps.120</title>.4" osisID="Ps.120</title>.4" n="4" /><transChange type="added">Es como</transChange> saetas de valiente agudas con brasas de enebros.
<verse eID="Ps.120</title>.4" />
<verse sID="Ps.120</title>.5" osisID="Ps.120</title>.5" n="5" />¡Ay de mí que peregrino en Mesec: habito con las tiendas de Cedar!
<verse eID="Ps.120</title>.5" />
<verse sID="Ps.120</title>.6" osisID="Ps.120</title>.6" n="6" />Mucho se detiene mi alma con los que aborrecen la paz.
<verse eID="Ps.120</title>.6" />
<verse sID="Ps.120</title>.7" osisID="Ps.120</title>.7" n="7" />Yo <transChange type="added">soy</transChange> pacífico; y cuando hablo, ellos guerrean.</title>
<verse eID="Ps.120</title>.7" />
<chapter eID="Ps.120</title>" />

Afterwards, Psalm 121 continues OK.

DavidHaslam commented 6 years ago

The same USFM files contain numerous chapter label tags \cl CAPITULO nn.

Another critical bug is that when these are converted to XML milestone elements, the element is not terminated with />.

<chapter sID="Gen.1" osisID="Gen.1" n="1" />
<!-- cl --><milestone type="x-chapterLabel" n="CAPITULO 1
<verse sID="Gen.1.1" osisID="Gen.1.1" n="1" />En el principio creó Dios los cielos y la tierra.

This is also a new bug. Earlier versions of u2o.py used to handle this tag correctly.

The title bug also affects the descriptive Psalm titles after processing the \d tags.

<chapter sID="Ps.3" osisID="Ps.3" n="3" />
<!-- cl --><milestone type="x-chapterLabel" n="SALMO 3" />
<!-- d --><title type="psalm" canonical="true">Salmo de David, cuando huía de delante de Absalom su hijo.
adyeths commented 6 years ago

Without seeing the usfm source I am not going to be able to do anything to fix this.

DavidHaslam commented 6 years ago

Both bugs are already present in release 0.6 which I just tried as a cross-check, to make sure that it's not due to one of your subsequent commits.

Unfortunately, I didn't retain a copy of my download from November 2017.

The attached Zip file contains my set of USFM files for your debugging.

USFM.zip

I myself generated these files from the text files supplied by my contact.

They have not been checked with any Bible translation editing software, but they are fairly simple in structure. They contain no non-standard markers.

DavidHaslam commented 6 years ago

FIO: USFM tag statistics for the concatenated data.

merged.usfm.tags.count.usfm.zip

DavidHaslam commented 6 years ago

Please let me know if you require any further information.

adyeths commented 6 years ago

This happened because the reflow routine in u2o doesn't handle text without paragraph/poetry markers correctly. It make take me a while to fix this.

DavidHaslam commented 6 years ago

Understood. Essentially this is a Verse Per Line Bible version. i.e. I plan to include Feature=NoParagraphs in the SWORD module configuration.

The only few places that use \p in this translation are the colophons at the end of each of the 14 Pauline Epistles.

As an interim workaround, I could insert \p immediately before each \v 1 and to confirm whether that suppresses these errors.

adyeths commented 6 years ago

Please don't use hacks to workaround bugs in u2o. Let me fix it. Further testings showed it's more than just a bug in the handling of texts without paragraph/poetry markers.

DavidHaslam commented 6 years ago

Adding \p as proposed only solves the issues with \d and cl tags.

It didn't fix the issue with \qa tags, albeit the first one was correct.

I guess that what's needed as a workaround is to mark each verse in Psalm 119 as poetry. Or at least, the first verse in each of the 22 stanzas.

DavidHaslam commented 6 years ago

Thanks for further advice. It was only a temporary hack in a conversion script (actually a bespoke TextPipe filter). Easy to revert.

FIO: Marking each stanza in Psalm 119 as either poetry or paragraph was indeed a successful workaround.

cf. Trying this was also useful for me to confirm that there were no further XML syntax errors or OSIS validation fails elsewhere in the file. That's good for peace of mind and planning next steps.

i.e. It gave me confidence that once you've succeeded in fixing this, then the task of module building should be fairly straightforward.

adyeths commented 6 years ago

Ok, should be fixed now.

DavidHaslam commented 6 years ago

Thanks, Ryan.

Downloaded it and retested it with the unhacked USFM files. The OSIS file now validates.

Best regards,

David