Better handling of HTML to XHTML

dteviot commented 7 years ago

There are a number of HTML pages that are not correctly converted to XHTML. e.g.

http://skythewood.blogspot.com/ has <o:p> nodes. I think they're supposed to be <p> nodes, but there's no 'o' namespace
http://novelplanet.com/Novel/10-nen-goshi-no-HikiNiito-o-Yamete-Gaishutsushitara-Jitaku-goto-Isekai-ni-Ten-ishiteta/c6-part9?id=195323 (On investigation, has same problem as skythewood.)
https://royalroadl.com/fiction/5288/how-to-avoid-death-on-a-daily-basis/chapter/69854/73-night-of-the-living-zombers (This has a text node holding a "Form Feed" (value 0x0c) character. Not to be confused with line feed.)

http://gravitytales.com/novel/so-pure-so-flirtatious/spsp-chapter-105 has malformed attributes e.g.

<p style="box-sizing: inherit; margin: 0px 0px 1em; outline: 0px !important; padding: 0px; color: rgb(68, 73, 80); font-family: " trebuchet="" ms",="" "helvetica="" neue",="" helvetica,="" tahoma,="" sans-serif;="" font-size:="" 14px;="" font-style:="" normal;="" font-variant-ligatures:="" font-variant-caps:="" font-weight:="" 400;="" letter-spacing:="" orphans:="" 2;="" text-align:="" start;="" text-indent:="" 0px;="" text-transform:="" none;="" white-space:="" widows:="" word-spacing:="" -webkit-text-stroke-width:="" background-color:="" rgb(255,="" 255,="" 255);="" text-decoration-style:="" initial;="" text-decoration-color:="" initial;"="">-- So Pure, So Flirtatious is a novel translated on Gravity Tales. Please visit:&nbsp;<a href="http://gravitytales.com/novel/so-pure-so-flirtatious" rel="noreferrer noopener" tabindex="0" target="_blank" title="http://gravitytales.com/novel/so-pure-so-flirtatious" style="box-sizing: inherit; background: 0px 0px; text-decoration: none; outline: 0px !important;">http://gravitytales.com/novel/so-pure-so-flirtatious</a></p>

It's also annoying that user is only aware of problem when the EPUB reader faults on the page. Fix probably needs to include following

[x] Have plug-in validate the generated XHTML and warn when there's a problem.
[ ] Come up with more reliable way to convert. Note, short term fix may be examine the failure cases and see if there's a common issue that can be fixed, before writing own HTML to XHTML converter.

toshiya44 commented 7 years ago

Yep. They're supposed to be just <p> nodes. Apparently MS word does that when a doc is converted to html. https://stackoverflow.com/questions/7808968/what-do-op-elements-do-anyway

As for the <o:p></o:p> pairs that appear inside paragraphs, they can be safely deleted.

dteviot commented 7 years ago

@typhoon71, @toshiya44, @dreamer2908

Latest commit to Experimental Tab Branch should now generate EPUB v3 files if you check the "Create EPUB 3" advanced option. I hope this will solve the problem with sometimes not being able to convert HTML into valid XHTML. (EPUB 3 uses HTML 5 instead of XHTML.) This requires your EPUB reader to support EPUB 3, but hopefully by now most do. Please try it out and let me know how well the EPUB 3 works with your readers. Thanks.

toshiya44 commented 7 years ago

To be honest I'm not very knowledgeable about EPUB3. I'm using this Sigil plugin, which is supposed to contain EpubCheck 4.0.2, in order to check errors.

Calibre officially doesn't endorse EPUB3 due to various reasons, so the editor part of Calibre is not very specialized for this. Please see this thread. However, the viewer has no issues with rendering EPUB3 (this was also discussed in that thread).

URL: https://royalroadl.com/fiction/5288/how-to-avoid-death-on-a-daily-basis/chapter/69854/73-night-of-the-living-zombers

Errors: In the OPF, the image is listed as "image/jpeg" even though it's a has a PNG extension (apparently the image is actually a bmp, no clue what's going on). There's also an error notice for the "Form Feed" (value 0x0c) character that you mentioned in the issue.

URL: https://skythewood.blogspot.ca/2017/07/F15.html

Error: Epubcheck complains about having a name attribute ( <a name="more"></a> ).

By the way, According to this stackoverflow thread, there doesn't seem to be any difference between name and id attribute in the context of ePub. So wouldn't it be fine to replace name with id? I've seen name attributes used as ids in wordpress sites as well.

dteviot commented 7 years ago

@toshiya44 Firstly, thanks very much for the prompt response.

Errors: In the OPF, the image is listed as "image/jpeg" even though it's a has a PNG extension (apparently the image is actually a bmp, no clue what's going on).

I assume you're referring to: https://cdn.royalroadl.com/mooderino/6edd9796-3cb7-434c-a5ad-bb7dece2967a.png. It's listed as "image/jpeg" in the OPF is because when the file is fetched from the web server, the "content-type" in the HTTP response was "image/jpeg", so that's what went into the OPF file. Images don't always have extensions, so I was relying on the content-type. That said, this doesn't seem to cause any problems with the reader.

There's also an error notice for the "Form Feed" (value 0x0c) character that you mentioned in the issue.

Um. yes, I haven't fixed the warnings yet. That said, with EPUB viewer, it would not show the chapter due to the Form Feed character. Now it shows the HTML without a problem.

dteviot / WebToEpub

Better handling of HTML to XHTML #118