daisy / pipeline-modules-common

!! NOTE: This project is now part of the pipeline-modules project !! | Generic utility modules for the DAISY Pipeline 2
GNU Lesser General Public License v3.0
3 stars 2 forks source link

px:fileset-store seems to add trailing whitespace when indent="true" #69

Open josteinaj opened 10 years ago

josteinaj commented 10 years ago

See nlbdev/nordic-epub3-dtbook-migrator#94

josteinaj commented 10 years ago

Input:

<h1 id="h1_4">1 Introduction: Standpoint Theory as a Site of Political, Philosophic, and Scientific Debate</h1>

Storing the input with this:

<d:file method="xhtml" encoding="utf-8" indent="true" version="1.0" media-type="application/xhtml+xml" omit-xml-declaration="false" href="EPUB/DTB09004-05-chapter.xhtml"/>

Stores it like this:

<h1 id="h1_4">1 Introduction: Standpoint Theory as a Site of Political, Philosophic, and Scientific
         Debate
      </h1>

While storing the input with this:

<d:file method="xhtml" encoding="utf-8" indent="false" version="1.0" media-type="application/xhtml+xml" omit-xml-declaration="false" href="EPUB/DTB09004-05-chapter.xhtml"/>

Stores it like this:

<h1 id="h1_4">1 Introduction: Standpoint Theory as a Site of Political, Philosophic, and Scientific Debate</h1>

This only occurs for long texts, and it doesn't seem to happen only for headlines. I suspect it's got to do with the serialization performed by p:store in calabash (or one of calabash's dependencies).

rdeltour commented 10 years ago

I'm failing to see the real issue there: when you set indent="true", you essentially leave it to the processor to apply serialization rules. which conform to XSLT and XQuery serialization –note also the more recent 3.0 version which is not yet referenced by XProc.

Indenting XML (or whatever) typically means that you add whitespace characters.

Given that an HTML user agent will strip and collapse whitespace, what's wrong with the use case above ? Do you have a rendering issue ?

rdeltour commented 10 years ago

Mmm, on further reading of the serialization spec, it says that:

Whitespace MUST NOT be added other than before or after an element, or adjacent to an existing whitespace character.

which would mean there's a bug indeed. AFAIK Calabash is delegating to Saxon's serializer, so it w/b interesting to check with latest versions of these and report the issue if needed.

josteinaj commented 3 years ago

This is still an issue. I tried removing the custom pretty-printing XSLT in the nordic migrator, but it seems it is still needed.

bertfrees commented 3 years ago

Thanks for checking. I created an XProcSpec test.

bertfrees commented 3 years ago

@josteinaj Is it an option for you to not use method="xhtml"?

bertfrees commented 3 years ago

For HTML I think this result might be correct because spaces at the end of blocks are not rendered. If this also happens with inline elements, there is a problem though.

EDIT: OK I tried with this example:

<h1><span>1 Introduction: Standpoint Theory as a Site of Political, Philosophic, and Scientific Debate</span>.</h1>

It results in:

<h1>
   <span>1 Introduction: Standpoint Theory as a Site of Political, Philosophic, and Scientific
      Debate
   </span>.
</h1>

With method="xml" (and media-type="application/xhtml+xml") we get:

<h1>
   <span>1 Introduction: Standpoint Theory as a Site of Political, Philosophic, and Scientific Debate</span>.</h1>
josteinaj commented 3 years ago

We need to use method=xhtml because not all HTML tags are self closing. With method=xml we'd end up with

<div epub:type="pagebreak" title="1"/>

instead of

<div epub:type="pagebreak" title="1"></div>
bertfrees commented 3 years ago

OK I see. So this is an issue of being compatible with HTML-only readers?

There's not much we can do about this apart from filing a bug report with Saxon. The issue doesn't appear to be listed in the change log, but let's try with Saxon 10 first.