daisy / pipeline-modules

Modules for the DAISY Pipeline project
4 stars 5 forks source link

html-to-epub3 should output XHTML files with `<!DOCTYPE html>` #77

Open bertfrees opened 8 months ago

bertfrees commented 8 months ago

Issue reported by Tom McCartney:

In running ePubCheck, I've been getting a recurring error in every project that the DocType was invalid and that the document was "Not well formed' for the actual content file. I've been providing a full XHTML Transitional DocType, complete with name and dtd reference in the source content that I have, and then using the html-to-epub3 conversion to create the actual ePub content. When I run ePubCheck, it's not happy with the "Transitional" DocType. After reading up on DocTypes again (I haven't looked at the specs on that particular item in quite a while) I see that for HTML 5, the valid DocType is simply ! which looked completely wrong to me without the name and DTD. I see now that it's valid, and I'll try to figure out how to get my XSLT to output an HTML 5 DocType element. But it seems odd to me that the html-to-epub3 will allow the Transitional XHTML, and it looks like the ePub Spec simply specifies valid XHTML, but ePubChec explicitly wants HTML 5. I would argue that one or the other of those should change so that both give the same answer one way or the other. But that's part of the question that I have here.

GrayWolfMT commented 8 months ago

This may be a Windows related issue - in looking at the output of both the html-to-epub3 and epub3-to-epub3 jobs used to convert from HTML through to ePub with Audio, I am seeing a couple of "Error" messages at the end of the output, but the resulting ePub is successfully produced. The error is " SetDoctype: px:set-doctype failed to read from ..." and then lists a full path to a file (I'll include screenshots from each step.)

I was able to check and the files actually exist in the location used by the ePub Enhancement script, so the path is correct. I was unable to find the folder referenced by the HTML to ePub script, since the job was removed after processing.

EpubEnhancementMsg HtmlToEpubMsg

bertfrees commented 8 months ago

@GrayWolfMT It looks like it might be a Windows issue indeed. I'm going to test it on a VM. If it isn't too much work for you, you may already send me the full log of the conversion.

bertfrees commented 8 months ago

I can't reproduce the "SetDoctype" issue, perhaps because I'm not using the same input file and job options as you are.

But anyway, I don't think the fact that the doctype is not set to <!DOCTYPE html> is specific to Windows after all. It seems this just isn't something that Pipeline does at the moment.

bertfrees commented 5 months ago

I looked too fast. It seems we do something, except not in a place I expected: https://github.com/daisy/pipeline-modules/commit/78708687b4051a07bb832d6de8f73750d81da262.