dteviot / WebToEpub

A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
Other
731 stars 139 forks source link

br tags #34

Closed toshiya44 closed 8 years ago

toshiya44 commented 8 years ago

In BakaTsukiParser.js it says, // discard br tags as epubcheck says they are invalid in the places they are at in xhtml util.removeElements(util.getElements(element, "br"));

br tags are not invalid, they only need to be closed properly in xhtml, like <br/> In Baka-Tsuki pages br tags show up as <br>, that's why epubcheck didn't like it. Can you please replace all <br> by <br/> instead of removing it?

belldandu commented 8 years ago

As far as i can tell baka-tsuki literally does <br> which is not allowed in xhtml. Not including any special cases.

belldandu commented 8 years ago

Welp

// Replace elements that are invalid in xhtml with their valid counter parts
BakaTsukiParser.prototype.replaceInvalidElements = function(element) {
    // replace br tags
    let brs = util.getElements(element, "br");
    for(let br of brs) {
        let newBr = document.createElement("br");
        br.parentNode.replaceChild(newBr, br);
    }
}

Still gives bogus error about <br /> being incorrect @dteviot i believe a doctype tag is in order.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
   "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
belldandu commented 8 years ago

Although doing this means we will have to import nodes before appending them

dteviot commented 8 years ago

@belldandu & @toshiya44

My understanding is that the only problem this is giving at the moment is epubcheck is reporting this as a violation. Readers are not having a problem with it. As such, I'm regarding it as a "trivial" problem and I'm ignoring it until more critical issues are fixed. See: my most recent post to https://baka-tsuki.org/forums/viewtopic.php?f=84&t=16114&start=45#p271810

If my understanding is incorrect, please let me know.

On Fri, Jul 1, 2016 at 6:39 AM, Belldandu notifications@github.com wrote:

Although doing this means we will have to import nodes before appending them

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dteviot/WebToEpub/issues/34#issuecomment-229750557, or mute the thread https://github.com/notifications/unsubscribe/AE6w2XlWrpMUkw5QDQp86Jgxvc3f5KGIks5qRA1lgaJpZM4JB7GM .

belldandu commented 8 years ago

@dteviot well actually i already figured out a way to fix the br tags without breaking dom minipulation (its a bit hacky like the xml declaration thing)

Main problem for what i was doing before (removing them) is that it completely removed some line breaks that where needed in some places.

Now what i do is i replace the br with the proper br tag and then append a docType that accepts the new br tag to the xhtml immediately after the xml declaration.

This way it doesn't affect our dom manipulation and epub readers / epubcheck wont go nuts about the tag not being allowed where its at.

I will push the pull request in a bit.

dteviot commented 8 years ago

I thought I was already adding that doc type tag. But maybe I was doing it at end, just before writing to the epub zip. In which case, need to do it when create initial xhtml file to fill in util. Then chrome should automatically close the tag.

On Fri, Jul 1, 2016 at 8:07 AM, Belldandu notifications@github.com wrote:

@dteviot https://github.com/dteviot well actually i already figured out a way to fix the br tags without breaking dom minipulation (its a bit hacky like the xml declaration thing)

Main problem for what i was doing (removing them) is that it completely removed some line breaks that where needed in some places.

Now what i do is i replace the br with the proper tag and then append a docType that accepts this br tag to the xhtml immediatly after the xml declaration.

This way it doesn't affect our dom manipulation and epub readers / epubcheck wont go nuts about the tag not being allowed where its at.

I will push the pull request in a bit.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dteviot/WebToEpub/issues/34#issuecomment-229773050, or mute the thread https://github.com/notifications/unsubscribe/AE6w2TnlulYZ13xJBiEAcBsG4Sx-ym4Yks5qRCHwgaJpZM4JB7GM .

belldandu commented 8 years ago

You create a document however none of the xhtml files have an actual doctype. There are a couple of other things that go along with this. Almost done with PR with push in 5 minutes @dteviot

belldandu commented 8 years ago

@dteviot @toshiya44 see #35

dteviot commented 8 years ago

@belldandu BakaTsukiParser.replaceInvalidElements() which is supposed to close <br> tags does not appear to do anything useful.

See this test

test("replaceInvalidElements", function (assert) {
    let dom = new DOMParser().parseFromString(
        "<html>" +
            "<head><title></title></head>" +
            "<body>" +
                "<p>SomeText</p>" +
                "<br>" +
                "<p>More</p>" +
            "</body>" +
        "</html>",
        "text/html");
    let parser = new BakaTsukiParser();
    let content = dom.body.cloneNode(true);
    parser.replaceInvalidElements(content);

    // this assert fails.
    assert.equal(content.innerHTML, "<p>SomeText</p><br /><p>More</p>");
});

I think Chrome automatically closes the <br> tag when converting nodes from html to xhtml

See this test

test("replaceInvalidElements", function (assert) {
    let dom = new DOMParser().parseFromString(
        "<html>" +
            "<head><title></title></head>" +
            "<body>" +
                "<p>SomeText</p>" +
                "<br>" +
                "<p>More</p>" +
            "</body>" +
        "</html>",
        "text/html");
    let parser = new BakaTsukiParser();
    let content = dom.body.cloneNode(true);

    let xhtml = util.createEmptyXhtmlDoc();
    let body = xhtml.getElementsByTagName("body")[0];
    body.parentNode.replaceChild(content, body);

    assert.equal(xhtml.getElementsByTagName("body")[0].outerHTML, 
        "<body xmlns=\"http://www.w3.org/1999/xhtml\"><p>SomeText</p><br /><p>More</p></body>");
});

I also checked a Baka-Tsuki story with <br> tags and confirmed they are being closed in the generator's output when replaceInvalidElements() has been removed. I think we need to look more carefully at what epubcheck is complaining about. Possibly all that's needed is the doctype element.

belldandu commented 8 years ago

@dteviot weird and ok.

belldandu commented 8 years ago

@dteviot @toshiya44 can this be closed?

dteviot commented 8 years ago

@belldandu @toshiya44

can this be closed?

Only if epubcheck is now happy with the <br /> tags we're now using. If not, that needs to be fixed. Note, at current time I have not set up epubcheck to run on my system.

belldandu commented 8 years ago

@dteviot will check tommorow

belldandu commented 8 years ago

@dteviot

[kami@Index ~]$ epubcheck Toaru_Maju...dexVolume1.epub 
Validating using EPUB version 2.0.1 rules.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0006_Part_5.xhtml(66,7): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0006_Part_5.xhtml(66,115): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0008_Part_7.xhtml(13,7): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0008_Part_7.xhtml(13,115): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0008_Part_7.xhtml(72,7): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0008_Part_7.xhtml(72,115): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0009_Chapter_2_...e_7th-Egde.xhtml(59,7): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0009_Chapter_2_...e_7th-Egde.xhtml(59,115): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0009_Chapter_2_...e_7th-Egde.xhtml(134,7): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0009_Chapter_2_...e_7th-Egde.xhtml(134,115): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0010_Part_2.xhtml(173,7): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0010_Part_2.xhtml(173,115): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0018_Epilogue_T...ohibitorum.xhtml(35,7): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0018_Epilogue_T...ohibitorum.xhtml(35,115): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0018_Epilogue_T...ohibitorum.xhtml(108,7): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0018_Epilogue_T...ohibitorum.xhtml(108,115): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0018_Epilogue_T...ohibitorum.xhtml(112,7): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/Text/0018_Epilogue_T...ohibitorum.xhtml(112,115): Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.

Check finished with errors

epubcheck completed

Welp

dteviot commented 8 years ago

Was extension running under Chrome or Firefox? As noted in https://github.com/dteviot/WebToEpub/issues/49 I suspect Chrome closes <br> elements for us, but Firefox does not.

belldandu commented 8 years ago

It was run under chromium @dteviot

chromium

dteviot commented 8 years ago

@belldandu Can you send me some of the offending files? e.g. 0006_Part_5.xhtml and 0008_Part_7.xhtml, I'd like to take a closer look at exactly where/what the problem is.

belldandu commented 8 years ago

I know exactly what the problem is. XD the change i made before where i just replaced the BR tags had actually fixed that issue. @dteviot The problem is inconsistencies with the BR tags on baka-tsuki.

dteviot commented 8 years ago

I want to confirm the problem that epubcheck is reporting is <br> elements that are not closed. And not something else. I've seen Chrome closing the tags, but if Chromium is not closing them, then it's possible that Chrome doesn't either, at least in some cases. Deleting and re-creating the elements may result in closed tags, if it's creating XTHML elements, not HTML elements.
I suspect the correct way to solve the problem is using importNode() https://developer.mozilla.org/en-US/docs/Web/API/Document/importNode when we copy the "content" from the Baka-Tsuki HTML to the EPUB's XHTML. example: http://stackoverflow.com/questions/12092532/how-to-convert-html-to-valid-xhtml. Note, I also need to create the empty XHTML document correctly, using createElementNS(), and it's family.

Note, I have not yet tested this to confirm. At moment, it's just a working hypothesis.

toshiya44 commented 8 years ago

I believe this issue is now fixed too. The plugin in sonako branch closes br tags properly.

dteviot commented 8 years ago

I believe I now understand where the <br> error is coming from. epubcheck wants the <br> enclosed in a <p> tag. e.g. This XHTML fragment

<body>
    <p>Some Text</p>
    <br />
    <p>More Text</p>
</body>

gives the error message

Error while parsing file 'element "br" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")'.

This XHTML fragment does not

<body>
    <p>Some Text</p>
    <p><br /></p>
    <p>More Text</p>
</body>

I've logged the issue with epubcheck https://github.com/IDPF/epubcheck/issues/700, and I'll see what they come back with. Note, they have had a similar issue in the past. https://github.com/IDPF/epubcheck/issues/209

dteviot commented 8 years ago

@toshiya44, @belldandu Well, I got a response from the IDPF team

If you're creating an EPUB 2, then that's correct. The br element isn't allowed as a child of the body in XHTML 1.1. It's not an error in HTML5, and I can't reproduce in a conforming EPUB 3 publication.

I'm going to class this as a "won't fix". The behaviour is technically wrong, but:

  1. The <br /> tags ARE closed,
  2. The vast majority of epub viewers handle it without a problem.
  3. The effort required to determine if each <br /> tag is a child of the body or not is probably not worth it. There are more useful things I can do with my time.
belldandu commented 8 years ago

@dteviot Agreed.

If i have time when i get back in 6 months and all the other issues have been fixed i might take a crack at it. But for now i agree with the "wontfix".