Sigil-Ebook / PageEdit

ePub XHTML Visual Editor
GNU General Public License v3.0
245 stars 28 forks source link

Bug standalone use of PageEdit handle encoding and xml header missing #13

Closed ghost closed 4 years ago

ghost commented 4 years ago

I have built Sigil Version 1.1.0 and PageEdit (GitTag= 1.1.1) from source without errors in a Fedora 31 Linux environment. Sigil is performing fine and dealing with my epubs ok. Not so sure about PageEdit. To demo the problem, I have created simple html and css files, 12 lines each, and moved them to a test directory ~/HTML-TEST. These files are in test.zip attached.

In the bin directory of the cloned PageEdit Repo I run ... ./pageedit ~/HTML-TEST/Test.html

and PageEdit appears and reports an error concerning style info as per screen capture attached.

In the ~/HTML-TEST directory I run ... ~/GIT/PageEdit/build/bin/pageedit ./Test.html and get the same error.

Of course, Chrome, Firefox and SigilPreviewPane all display the contents correctly. Can you give any guidance as to how I may best proceed ~ thanks ~ JP

test.zip PageEditScreenCapture

dougmassay commented 4 years ago

It's not valid xhtml or html5. Sigil (and Chrome and Firefox) automatically correct it for you. PageEdit does not.

If you want stuff like this to be automatically corrected, you'd be better off launching PageEdit from Sigil (via the open with External HTML editor).

dougmassay commented 4 years ago

Actually, I take that back, it IS valid HTML5. My apologies.

dougmassay commented 4 years ago

It's the encoding. PageEdit doesn't seem to like opening files that aren't utf-8 encoded. Once again, though, this is something that Sigil will correct.

Though PageEdit probably should be changed to accommodate things like this since it can technically be used to edit standalone html.

dougmassay commented 4 years ago

I'm wrong again. It's the lang="en" parameter that it doesn't like.

ghost commented 4 years ago

Hi Doug, thanks for your prompt response. I removed the lang="en" and indeed the little html test file popped up nicely. I then went onto view my entire book (in html form) and PageEdit works well. I am on a new problem now (I cannot get the simplest epub to work). However that is a different topic so I think it is right that I should close this ticket now. Thanks once again for your help ~ JP

dougmassay commented 4 years ago

To be fair, we probably should look into letting PageEdit handle the language parameter in the HTML5 html tag. It's valid code, and PageEdit is not strictly an epub editor (though it it will probably most often be used for that).

kevinhendricks commented 4 years ago

He had no doctype but we still tried loading the file as xhtml, not html. Either way It should not barf on any extra attributes. Not sure what is going on here at all.

I will look into it.

BTW, gumbo requires utf-8 to work so we should definitely make sure the xhtml files are utf-8 encoded when opened. In Sigil this is always the case.

Kevin

On Feb 20, 2020, at 3:52 PM, Doug Massay notifications@github.com wrote:

 To be fair, we probably should look into letting PageEdit handle the language parameter in the HTML5 html tag. It's valid code, and PageEdit is not strictly an epub editor (though it it will probably most often be used for that).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

dougmassay commented 4 years ago

The sample he included does have a doctype. It's just not showing when the "xml file" error message is shown in PageEdit. As for the encoding, perhaps PageEdit is already handling that? I was mistaken when I said the Windows 1252 encoding was the issue. PageEdit opened the 1252 encoded file OK once I removed the lang parameter.

kevinhendricks commented 4 years ago

I looked into this. In PageEdit we specify it is to be loaded as xhtml/xml but there is no specification as to which version of xml in his file. In Sigil our xml header line specifies version "1.0" before the doctype). Without that, things will then default to the latest which is 1.1 and in version 1.1 the bare lang attribute is no longer valid and will cause an error. In even more recent versions of the xml spec, support for it was put back.

So simply adding the proper xml header line should make this issue go away.

kevinhendricks commented 4 years ago

Yes, we need to copy the Misc/HTMLEncodingResolver.cpp and .h into PageEdit and force its use on the first loading of any text file to convert it to utf-8 properly. Right now, Sigil handles that for us but we really need to add that to PageEdit sources as well.

I will do that for the next release. I will also change the code to force adding the proper xml header and version too.

I am going to reopen this to help remind me of these bugs to be fixed.

ghost commented 4 years ago

Thank you to Kevin and Doug for all the extra background work on this topic ~ I leave it up to to you guys to close the ticket again when you are satisfied all is well. Also (as part of the closure) is there any chance you could leave some advice on how users can get extra debug info from PageEdit. (Eg would it possible to run PageEdit from command line with an extra parameter "--debug")

dougmassay commented 4 years ago

Also (as part of the closure) is there any chance you could leave some advice on how users can get extra debug info from PageEdit. (Eg would it possible to run PageEdit from command line with an extra parameter "--debug")

We have to release a special debug version of PageEdit to get get any valuable info from the running program. On Linux, there's nothing you could enable (command-line parameter, environment variable, etc...) that would cause more detail to be printed to stdout. There's nothing a user could have done in this situation (other than present us with symptoms/samples) to get any extra info from PageEdit.

kevinhendricks commented 4 years ago

Okay added in HTMLEncodingResolver to handle non-utf8 encodings, and used GumboInterface to convert any html to xhtml and to properly set the xml header to the correct version. So this should now be fixed in master. Closing.