dteviot / WebToEpub

A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
Other
720 stars 136 forks source link

Baka-Tsuki - epubcheck errors #32

Open dreamer2908 opened 8 years ago

dreamer2908 commented 8 years ago
  1. Images can be embedded in B-T stories in form of inline images instead of thumbnails. The result xhtml code will be (slightly) invalid if WebToEpub encounters this type of images: div tag is inside p tag.
    Example: All non-gallery images here: Utsuro no Hako:Volume 1
    Result xhtml code for the first image:
    <p><div class="svg_outer svg_inner"><svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" height="100%" width="100%" version="1.1" preserveAspectRatio="xMidYMid meet" viewBox="0 0 1368 1000"><image xlink:href="../Images/0006_Utsuro_no_..._vol1_pic1.jpg" height="1000" width="1368"/><desc>https://www.baka-tsuki.org/project/index.php?title=File:Utsuro_no_Hako_vol1_pic1.jpg</desc></svg></div> </p>
    Epubcheck error message:
    ERROR: /home/yumi/Downloads/Utsuro_no_...koVolume_1.epub/OEBPS/Text/0000_Novel_Illustrations.xhtml(2,34): element "div" not allowed here; expected the element end-tag, text or element "a", "abbr", "acronym", "applet", "b", "bdo", "big", "br", "cite", "code", "del", "dfn", "em", "i", "iframe", "img", "ins", "kbd", "map", "noscript", "ns:svg", "object", "q", "samp", "script", "small", "span", "strong", "sub", "sup", "tt" or "var" (with xmlns:ns="http://www.w3.org/2000/svg")
  2. WebToEpub doesn't convert the deprecated u tag (underline) into suitable form for epub.
    <p>Normal <u>underline></u></p> should become <p>Normal <span style="text-decoration: underline;">underline></span></p>
    Sample: same as above.
    Epubcheck error message:
    ERROR: /home/yumi/Downloads/Utsuro_no_...koVolume_1.epub/OEBPS/Text/0001_Prologue.xhtml(4,85): element "u" not allowed anywhere; expected the element end-tag, text or element "a", "abbr", "acronym", "applet", "b", "bdo", "big", "br", "cite", "code", "del", "dfn", "em", "i", "iframe", "img", "ins", "kbd", "map", "noscript", "ns:svg", "object", "q", "samp", "script", "small", "span", "strong", "sub", "sup", "tt" or "var" (with xmlns:ns="http://www.w3.org/2000/svg")
  3. Invalid id in span tag inside h* tag are not fixed, like <h3><span class="mw-headline" id="1st_time">1<sup>st</sup> time</span></h3>
    Epubcheck error message:
    ERROR: /home/yumi/Downloads/Utsuro_no_...koVolume_1.epub/OEBPS/Text/0002_1st_time.xhtml(1,497): value of attribute "id" is invalid; must be an XML name without colons
    Side note: BTE-GEN converts it into <h3 id="1st_time">, but it's still not fixed, and not useful here.

Well, some more, but I lost the samples.

BTE-GEN moves up heading if higher levels are missing, i.e h2 to h1, h3 to h2 if there's no h1. Can this be considered?

In list of references (translator's notes) in B-T web, the link to jump up to where the reference belongs to only has a single symbol. The same in BTE-GEN's output. In WebToEpub's output, it becomes Jump up ↑. If you remove cite-accessibility-label (class), the Jump up text will stop popping up out of nowhere.

Full disclose: I'm developing my own (not easy-to-use) Baka-Tsuki to epub converter, which is for freaks like me, and not for normal users at all.

dteviot commented 8 years ago

Thanks for the list of special cases.

On Wed, Jun 29, 2016 at 4:04 PM, dreamer2908 notifications@github.com wrote:

  1. Images can be embedded in B-T stories in form of inline images instead of thumbnails. The result xhtml code will be (slightly) invalid if WebToEpub encounters this type of images: div tag is inside p tag. Example: All non-gallery images here: Utsuro no Hako:Volume 1 https://www.baka-tsuki.org/project/index.php?title=Utsuro_no_Hako:Volume_1 Result xhtml code for the first image:

    https://www.baka-tsuki.org/project/index.php?title=File:Utsuro_no_Hako_vol1_pic1.jpg

    Epubcheck error message: ERROR: /home/yumi/Downloads/Utsurono...koVolume_1.epub/OEBPS/Text/0000_Novel_Illustrations.xhtml(2,34): element "div" not allowed here; expected the element end-tag, text or element "a", "abbr", "acronym", "applet", "b", "bdo", "big", "br", "cite", "code", "del", "dfn", "em", "i", "iframe", "img", "ins", "kbd", "map", "noscript", "ns:svg", "object", "q", "samp", "script", "small", "span", "strong", "sub", "sup", "tt" or "var" (with xmlns:ns=" http://www.w3.org/2000/svg")

  2. WebToEpub doesn't convert the deprecated u tag (underline) into suitable form for epub.

    Normal underline>

    should become

    Normal underline>

    Sample: same as above. Epubcheck error message: ERROR: /home/yumi/Downloads/Utsurono...koVolume_1.epub/OEBPS/Text/0001_Prologue.xhtml(4,85): element "u" not allowed anywhere; expected the element end-tag, text or element "a", "abbr", "acronym", "applet", "b", "bdo", "big", "br", "cite", "code", "del", "dfn", "em", "i", "iframe", "img", "ins", "kbd", "map", "noscript", "ns:svg", "object", "q", "samp", "script", "small", "span", "strong", "sub", "sup", "tt" or "var" (with xmlns:ns=" http://www.w3.org/2000/svg")

  3. Invalid id in span tag inside h* tag are not fixed, like

    <span class="mw-headline" id="1st_time">1st time

    Epubcheck error message: ERROR: /home/yumi/Downloads/Utsurono...koVolume_1.epub/OEBPS/Text/0002_1st_time.xhtml(1,497): value of attribute "id" is invalid; must be an XML name without colons Side note: BTE-GEN converts it into

    , but it's still not fixed, and not useful here.

Well, some more, but I lost the samples.

  • center tag isn't allowed in epub, too.
    text
    should become

  • align attribute in p/span/div should be converted into css style text-align:

BTE-GEN moves up heading if higher levels are missing, i.e h2 to h1, h3 to h2 if there's no h1. Can this be considered?

In list of references (translator's notes) in B-T web, the link to jump up to where the reference belongs to only has a single ↑ symbol. The same in BTE-GEN's output. In WebToEpub's output, it becomes Jump up ↑. If you remove cite-accessibility-label (class), the Jump up text will stop popping up out of nowhere.

Full disclose: I'm developing my own (not easy-to-use) Baka-Tsuki to epub converter, which is for freaks like me, and not for normal users at all.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dteviot/WebToEpub/issues/32, or mute the thread https://github.com/notifications/unsubscribe/AE6w2Umnuteh5N5vw1lswaMLLCA7q25fks5qQe7LgaJpZM4JAwH7 .

dreamer2908 commented 8 years ago

One more (potential) issue, well, if you have free time to play with.

Images, both in thumbnail and inline form, can have a custom target link, rather than link to image page. I haven't seen anyone using it in B-T, so it's not really important.

Example: User_talk:Dreamer2908. WebToEpub breaks pretty badly.

dteviot commented 8 years ago

Yup. That's one of the two key problems I'm currently trying to solve for https://github.com/dteviot/WebToEpub/issues/9 Hopefully I'll have it solved by the end of the weekend.

On Wed, Jun 29, 2016 at 5:21 PM, dreamer2908 notifications@github.com wrote:

One more (potential) issue, well, if you have free time to play with.

Images, both in thumbnail and inline form, can have a custom target link, rather than link to image page. I haven't seen anyone using it in B-T, so it's not really important.

Example: User_talk:Dreamer2908 https://www.baka-tsuki.org/project/index.php?title=User_talk:Dreamer2908. WebToEpub breaks pretty hard.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dteviot/WebToEpub/issues/32#issuecomment-229259152, or mute the thread https://github.com/notifications/unsubscribe/AE6w2ZmnNcP6IBVv6G0V1zUWkm1IpKPYks5qQgDPgaJpZM4JAwH7 .

belldandu commented 8 years ago

Well @dreamer2908

  1. Check if div's parent is p and if so then move out and remove the p tag.
  2. Easy to fix.
  3. is because of the number being the first thing it sees in the ID. There is no real "easy" way to fix this. Having a span tag inside of a header tag is perfectly valid. The problem is that epubcheck and epub readers do not like having numbers as the first character of the ID. We could check every elements ID for the first character and if that first character is a number then we append to the beginning ID (so that epubcheck doesnt go derp). However i'm not sure how dteviot would like this approach and i'm kind of against it mostly because cpu cycles.
  4. Center. same as 2.
  5. align. same as 2.
  6. custom target link. relatively easy actually. Although the real question is should we ignore / remove these custom links or follow them.

@dteviot

belldandu commented 8 years ago

@dteviot Will you be doing this or should i? Also it would be nice if i could self assign myself to certain issues.

dteviot commented 8 years ago

@belldandu

Will you be doing this or should i

If you want to do it, that's fine with me. As I've said at momement, I'm trying to get the "use URL to specify cover image". I think that's the highest gain item on the list currently.

Also it would be nice if i could self assign myself to certain issues.

Fine with me. Tell me what I need to do to give you the rights and I'll get it done.

belldandu commented 8 years ago

I should be able to if i'm contributor rank @dteviot

dteviot commented 8 years ago

@belldandu you should be a contributor now. If not, please let me know.

dreamer2908 commented 8 years ago

Well, I'll just leave this here.

WebToEpub v0.0.8 encounters a parsing error on this page: Utsuro_no_Hako:Volume2_May_2 (and Utsuro_no_Hako:Volume2, which includes it).

Screenshot: https://i.imgur.com/guVoRXM.png

belldandu commented 8 years ago

@dteviot i spelled that wrong its collaborator @dreamer2908 i'm looking into that.

dteviot commented 8 years ago

@belldandu try this.https://github.com/dteviot/WebToEpub/invitations

belldandu commented 8 years ago

@dteviot there we go.

dteviot commented 8 years ago

@dreamer2908

WebToEpub v0.0.8 encounters a parsing error on this page: Utsuro_no_Hako:Volume2_May_2 (and Utsuro_no_Hako:Volume2, which includes it).

D'oh! Fixed. My apologies for not noticing this sooner.

dreamer2908 commented 8 years ago

@dteviot Thanks. I completely forgot about this.

I checked out version 0.0.0.14, and it indeed no longer throws errors. But I noticed something strange: texts in part "May 2nd (Saturday) 00:31" are all italic in Baka-Tsuki, but in the generated epub, only the last sentence is italic.

dteviot commented 8 years ago

@dreamer2908

I noticed something strange: texts in part "May 2nd (Saturday) 00:31" are all italic in Baka-Tsuki, but in the generated epub, only the last sentence is italic.

That's odd. I'll add investigating to my ToDo list.

dteviot commented 8 years ago

@dreamer2908

I'm looking at fixing this issue

> Images can be embedded in B-T stories in form of inline images instead of thumbnails. The result xhtml code will be (slightly) invalid if WebToEpub encounters this type of images: div tag is inside p

This occurs because I'm wrapping the <svg> element in a <div class=”svg_outer svg_inner”>. I'm wrapping it in a <div> so that a style is applied to the <div>.

div.svg_outer {
   display: block;
   margin-bottom: 0;
   margin-left: 0;\r
   margin-right: 0;\
   margin-top: 0;\r
   padding-bottom: 0;
   padding-left: 0;
   padding-right: 0;
   padding-top: 0;
   text-align: left;
}
div.svg_inner {
   display: block;
   text-align: center;
}

The reason I'm doing this is because Lord Simon told me to do this. (He's the one who wrote BTE-GEN.)

An obvious fix (to me) would be to not have a wrapping <div> tag and apply the style directly to the <div> element. For that matter, I'm also puzzled why there's both a svg_outer and svg_inner style.

Anyway, my knowledge of CCS is extremely limited (as you've probably guessed by my above statements) so I'm hoping you could tell me WHY Lord Simon told me to do this, and why the changes I've suggested would be a bad idea. Failing that, can you point me in the direction of some good CCS documentation?

Thanks for your time.

dteviot commented 8 years ago

@dreamer2908

I noticed something strange: texts in part "May 2nd (Saturday) 00:31" are all italic in Baka-Tsuki, but in the generated epub, only the last sentence is italic.

OK, I know what's happening here. The entire chapter, except for the final line is wrapped in a <i> tag. i.e. the chapter looks like this

<i>
<h3>May 2nd (Saturday) 00:31<h3>
<p>Exactly 15 minutes …
</p>
</i>
<p><i>I think that makes us a perfect match, … </i></p>

But one of the steps of WebToEpub is to “flatten” the HTML, so that all header tags are immediate children of the body, so the italic tag is being discarded.

In this case, rather than trying to fix WebToEpub, I'm going to suggest the easiest way to fix edit the page on Baka-Tsuki, moving the <i> to after the </h3> tag.

I will attempt to make the change.

dreamer2908 commented 8 years ago

@dteviot

About inline images and div inside p, rather than changing the way you handle images, I think it's easier to find a suitable place for it. Either moving the image out of p before processing, or doing some sanity checking like whether div can really be inserted there would do.

About the italic stuff, well, it's indeed easy to fix the page on Baka-Tsuki. I already know what changes to make to the page, if you want to end the case with this. But erroneous html is everywhere (i isn't even allowed to wrap p), so some degree of error correction will be necessary eventually.

dteviot commented 8 years ago

@dreamer2908

You might like to take the latest version of the Sonako branch https://github.com/dteviot/WebToEpub/tree/sonako for a spin, I've been busy today.

rather than changing the way you handle images, I think it's easier to find a suitable place for it. Either moving the image out of p before processing

Yes, that's what I'm doing now. If parent is a <p> put the image before the tag.

WebToEpub doesn't convert the deprecated u tag (underline)

It does now.

center tag isn't allowed in epub, too

Also fixed.

If you remove cite-accessibility-label (class), the Jump up text will stop popping up out of nowhere

Done

Invalid id in span tag inside h* tag are not fixed, like

Those links were only needed for the table of contents on the original page. As they're no longer needed (page is split on Header tags) I'm removing them. (At least, the code is now supposed to remove them.)

About the italic stuff, well, it's indeed easy to fix the page on Baka-Tsuki. I already know what changes to make to the page.

So do I. Looks like someone put an open italic command at the start of the precceeding chapter, and didn't close it until the end of the following chapter. So there's two chapters in italics. In this case, I think it's an error by the translator. That is, the chapters are not supposed to be italic. I've sent a PM to the translator and we'll see what happens. My guess is nothing.

But erroneous html is everywhere (i isn't even allowed to wrap p), so some degree of error correction will be necessary eventually.

Agreed, error handling will be necessary. However, in this case, I think what the parser is doing is reasonable. (Discarding the weird italic tag.) But if you find other cases where the parser has problems please let me know.

dreamer2908 commented 8 years ago

@dteviot

I've checked out the latest sonako branch, and it seems to work as expected. GJ.

But if you find other cases where the parser has problems please let me know.

Well, if similar weirdness remaining unfixed is considered problems.

It seems that Baka-Tsuki would output weird html if a long section is italic/bold/etc and there's anything that is not text inside. Example: HEAVY_OBJECT:Volume11_Chapter_3#Part_12. The weirdness still remains in the generated epub.

This kind of usage of italic/bold is awfully familiar that I'm afraid it's everywhere.

dteviot commented 8 years ago

@dreamer2908

It seems that Baka-Tsuki would output weird html if a long section is italic/bold/etc and there's anything that is not text inside. Example: HEAVY_OBJECT:Volume11_Chapter_3#Part_12. The weirdness still remains in the generated epub.

This kind of usage of italic/bold is awfully familiar that I'm afraid it's everywhere.

I'm going to call it a bug. As this incident has so many issues in it I'm starting to loose track of them all I'm raising this as a new issue.

dteviot commented 8 years ago

I could have sworn i already fixed this in an earlier commit.

You tried, it didn't work properly. There were two problems.

  1. Didn't always result in a valid id. (example Fate/Zero has IDs that start with a '-')
  2. And it didn't update any hyperlinks that referred to the id.

Uhhhh doesn't removing them also break the citations at the bottom of the page?

If you mean footnotes, I'm only removing the ids that are not referred to. Footnotes seem to have valid IDs.

On Sat, Jul 30, 2016 at 6:45 AM, Belldandu notifications@github.com wrote:

Invalid id in span tag inside h* tag are not fixed, like

Those links were only needed for the table of contents on the original page. As they're no longer needed (page is split on Header tags) I'm removing them. (At least, the code is now supposed to remove them.)

Uhhhh doesn't removing them also break the citations at the bottom of the page?

I could have sworn i already fixed this in an earlier commit.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dteviot/WebToEpub/issues/32#issuecomment-236262220, or mute the thread https://github.com/notifications/unsubscribe-auth/AE6w2U-N64oCFTuvzlnSIJ0WWuYIP1joks5qakpkgaJpZM4JAwH7 .

dteviot commented 8 years ago

@dreamer2908

BTE-GEN moves up heading if higher levels are missing, i.e h2 to h1, h3 to h2 if there's no h1. Can this be considered?

Done in latest commit to Sonako branch.

align attribute in p/span/div should be converted into css style text-align:

Any chance you can locate an example or two of this please? I haven't found an example yet.

dreamer2908 commented 8 years ago

@dteviot

Here: Leviathan:Volume_5_Afterword

I've just looked at the wikitext and it turns out that the translator used a weird way to right align text. Feel free to skip this.

belldandu commented 8 years ago

Just a heads up @dteviot collections hit so I didn't get in ;-; and I have a job interview this Friday at 1 pm. Also my computer broke.