adyeths / u2o

USFM to OSIS bible format converter.
The Unlicense
19 stars 6 forks source link

XML pretty print and SWORD modules made from OSIS #27

Closed DavidHaslam closed 6 years ago

DavidHaslam commented 6 years ago

Practical question about XML pretty print

In theory at least, different amounts of whitespace in XML files should not matter.

When an OSIS XML file is "pretty print" formated, what effect might this have on the SWORD module made from the file compared to the module made from same source text that is not in "pretty print" format?

Can this lead (e.g.) to an unwanted space appearing between a word in the text and the superscripted footnote tag?

Even if this is more a SWORD issue rather than one for u2o.py, I'm still keen to learn from you.

adyeths commented 6 years ago

Pretty printing shouldn't make a difference with regards to extra whitespace preceding some tags. I doubt that the SWORD importing utility is introducing any extra whitespace where it doesn't already exist in the output from u2o. What is likely happening is that this whitespace is present in the osis files generated by u2o.(My script doesn't currently remove some whitespace that could be removed… for instance, preceding a closing paragraph tag.)

There are bugs in SWORD's osis2mod utility that can cause problems with modules generated from u2o's output. I don't believe this is one of them, though.

adyeths commented 6 years ago

added some code to remove the extra whitespace that may precede a note… as well as the extra space that may precede some closing tags.

DavidHaslam commented 6 years ago

Thanks for that.

Please would you also have a think about spurious whitespace getting inserted after (e.g.) a note element when the next item of text happens to be a punctuation mark.

i.e. Maybe keyword¹ ; could be displayed instead of keyword¹;

NB. Please bear in mind that in some languages, certain punctuation marks expect to have a preceding space.

In other words, what's required should faithfully reflect what's in the USFM file, and not just apply a "general fix".

adyeths commented 6 years ago

Since I have no knowledge of any languages other than English, the only possible fix I'd be able to make for a space after a note but preceding a punctuation mark would be a general fix. I'd need far more language information than I current have at this time before I could attempt a more proper fix.

adyeths commented 6 years ago

Are you sure that u2o is adding the extra space around notes? Or is the extra space already present in the usfm files? I reviewed the code for u2o to make sure I wasn't adding extra space around notes. And nothing I see in the code should do that.

DavidHaslam commented 6 years ago

My further question was more hypothetical than based on observations.

I've not yet attempted to make use of XML Pretty Print option in u2o.py.

I was wondering whether </note>; more verse text might get changed to

      </note>
      ; more verse text

such that by the time it's in the module, the EOL is treated as whitespace and displays accordingly as a space before the (e.g.) semicolon.

What do you think?

adyeths commented 6 years ago

I don't see any reason for something like that to happen with notes.

DavidHaslam commented 6 years ago

Another hypothetical question about pretty print.

Suppose before pretty print the file contained either

</note> <transChange type="added">.....</transChange>

or

</note> <divineName>Lord</divineName>

is there any possibility that the significant space immediately after the </note> might be squashed?

Aside: I have seen this occur with the pretty print option of the XML Tools add-on for Notepad++.

NB. Until I get round to installing lxml in CygWin on my PC, this is a feature of u2o that I've not had chance to test.

adyeths commented 6 years ago

No real effort is made to make the xml look good unless pretty printing is enabled.

Arbitrarily removing spaces both before and after a note is a very bad idea. It would most definitely cause a lot of problems.

DavidHaslam commented 6 years ago

Indeed.

I'm making perhaps the unwarranted assumption that one tool to pretty print XML works roughly the same as another.

The instance I observed with XML Tools was after a two stage change:

  1. Linarize [sic] XML
  2. Pretty print (XML only - with line breaks)

I'm making the assumption that the spaces vanished during step 1.

All I'm suggesting is that you keep an eye out in case lxml does anything similar when pretty print is commanded.

i.e. That > < might be squashed to >< in places where this is definitely a bad idea.