dariok / pdf2tei

MIT License
2 stars 1 forks source link

mixed content not handled correctly #3

Open lb42 opened 2 years ago

lb42 commented 2 years ago

A run like thus

<run top="755" left="194" width="486" height="16" size="15" rendition="#f17">
  <i xmlns="">by a</i> CLIENT<i xmlns=""> whom  the</i> LADY<i xmlns=""> and</i> GENTLEMAN<i xmlns=""> join.</i>
 </run>

becomes

 <run top="755" left="194" width="486" height="16" size="15" rendition="#f17">
 <hi rend="italics">by a whom  the and join.</hi> CLIENT LADY GENTLEMAN
 </run>

when processed by pt3.xsl

lb42 commented 2 years ago

The problem is with the template for :i which includes ` <xsl:apply-templates select="node() | following-sibling::[self::*:i]/node()"/> the effect of which is to destroy the order of the components inside mixed content. As a temporary fix, I changed this to simply . This means we may have multiple consecutive` elements which should be combined of course.

lb42 commented 2 years ago

A similar problem shows up when we have an italic sequence which spans two runs@ like this: `< l level="0" left="208" top="773" size="17" bottom="790" right="687">

Latour. I now ask you to get out of my house. (In fury) ` This becomes ` Latour. (In I now ask you to get out of my house.fury)`