daisy / pipeline

Super-project that aggregates all Pipeline related code, provides a common tracker for Pipeline related issues and holds the Pipeline website
http://daisy.github.io/pipeline
21 stars 21 forks source link

epub3-to-epub3 strips doctype from XHTML documents in input EPUB 3 fileset #613

Closed martinpub closed 2 years ago

martinpub commented 3 years ago

Expected Behavior

epub3-to-epub3 preserves doctype from XHTML documents in input EPUB 3 fileset.

Actual Behavior

epub3-to-epub3 strips doctype from XHTML documents in input EPUB 3 fileset.

Steps to Reproduce

  1. With an input EPUB 3 containing content file or nav.xhtml with <!DOCTYPE html>
  2. Run dp2 epub3-to-epub3 --source source.epub --data source.zip --output outputdir/ --tts false --braille false --sentence-detection false --update-lang-attributes true --update-identifier-in-content-docs true --update-title-in-content-docs true --metadata sample_metadata.xml

Details

As XHTML 5.0 is already explicitly specified in EPUB 3, I'm not sure if the HTML 5 doctype is strictly needed. However, I was uncertain if this can cause errors in certain reading systems/processing tools that might rely on HTML 5 parsing?

Environment

Logs

Logs

bertfrees commented 3 years ago

Thanks for the report. It seems XProc does not automatically add the doctype when storing documents and it is currently not possible to set the "html-version" serialization parameter.

martinpub commented 2 years ago

Thanks for the fix @bertfrees!