Open dreamer2908 opened 8 years ago
Thanks for the list of special cases.
On Wed, Jun 29, 2016 at 4:04 PM, dreamer2908 notifications@github.com wrote:
- Images can be embedded in B-T stories in form of inline images instead of thumbnails. The result xhtml code will be (slightly) invalid if WebToEpub encounters this type of images: div tag is inside p tag. Example: All non-gallery images here: Utsuro no Hako:Volume 1 https://www.baka-tsuki.org/project/index.php?title=Utsuro_no_Hako:Volume_1 Result xhtml code for the first image:
Epubcheck error message: ERROR: /home/yumi/Downloads/Utsurono...koVolume_1.epub/OEBPS/Text/0000_Novel_Illustrations.xhtml(2,34): element "div" not allowed here; expected the element end-tag, text or element "a", "abbr", "acronym", "applet", "b", "bdo", "big", "br", "cite", "code", "del", "dfn", "em", "i", "iframe", "img", "ins", "kbd", "map", "noscript", "ns:svg", "object", "q", "samp", "script", "small", "span", "strong", "sub", "sup", "tt" or "var" (with xmlns:ns=" http://www.w3.org/2000/svg")
- WebToEpub doesn't convert the deprecated u tag (underline) into suitable form for epub.
Normal underline>
should becomeNormal underline>
Sample: same as above. Epubcheck error message: ERROR: /home/yumi/Downloads/Utsurono...koVolume_1.epub/OEBPS/Text/0001_Prologue.xhtml(4,85): element "u" not allowed anywhere; expected the element end-tag, text or element "a", "abbr", "acronym", "applet", "b", "bdo", "big", "br", "cite", "code", "del", "dfn", "em", "i", "iframe", "img", "ins", "kbd", "map", "noscript", "ns:svg", "object", "q", "samp", "script", "small", "span", "strong", "sub", "sup", "tt" or "var" (with xmlns:ns=" http://www.w3.org/2000/svg")
- Invalid id in span tag inside h* tag are not fixed, like
<span class="mw-headline" id="1st_time">1st time
Epubcheck error message: ERROR: /home/yumi/Downloads/Utsurono...koVolume_1.epub/OEBPS/Text/0002_1st_time.xhtml(1,497): value of attribute "id" is invalid; must be an XML name without colons Side note: BTE-GEN converts it into, but it's still not fixed, and not useful here.
Well, some more, but I lost the samples.
- center tag isn't allowed in epub, too.
text should become- align attribute in p/span/div should be converted into css style text-align:
BTE-GEN moves up heading if higher levels are missing, i.e h2 to h1, h3 to h2 if there's no h1. Can this be considered?
In list of references (translator's notes) in B-T web, the link to jump up to where the reference belongs to only has a single ↑ symbol. The same in BTE-GEN's output. In WebToEpub's output, it becomes Jump up ↑. If you remove cite-accessibility-label (class), the Jump up text will stop popping up out of nowhere.
Full disclose: I'm developing my own (not easy-to-use) Baka-Tsuki to epub converter, which is for freaks like me, and not for normal users at all.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dteviot/WebToEpub/issues/32, or mute the thread https://github.com/notifications/unsubscribe/AE6w2Umnuteh5N5vw1lswaMLLCA7q25fks5qQe7LgaJpZM4JAwH7 .
One more (potential) issue, well, if you have free time to play with.
Images, both in thumbnail and inline form, can have a custom target link, rather than link to image page. I haven't seen anyone using it in B-T, so it's not really important.
Example: User_talk:Dreamer2908. WebToEpub breaks pretty badly.
Yup. That's one of the two key problems I'm currently trying to solve for https://github.com/dteviot/WebToEpub/issues/9 Hopefully I'll have it solved by the end of the weekend.
On Wed, Jun 29, 2016 at 5:21 PM, dreamer2908 notifications@github.com wrote:
One more (potential) issue, well, if you have free time to play with.
Images, both in thumbnail and inline form, can have a custom target link, rather than link to image page. I haven't seen anyone using it in B-T, so it's not really important.
Example: User_talk:Dreamer2908 https://www.baka-tsuki.org/project/index.php?title=User_talk:Dreamer2908. WebToEpub breaks pretty hard.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dteviot/WebToEpub/issues/32#issuecomment-229259152, or mute the thread https://github.com/notifications/unsubscribe/AE6w2ZmnNcP6IBVv6G0V1zUWkm1IpKPYks5qQgDPgaJpZM4JAwH7 .
Well @dreamer2908
p
and if so then move out and remove the p tag.@dteviot
@dteviot Will you be doing this or should i? Also it would be nice if i could self assign myself to certain issues.
@belldandu
Will you be doing this or should i
If you want to do it, that's fine with me. As I've said at momement, I'm trying to get the "use URL to specify cover image". I think that's the highest gain item on the list currently.
Also it would be nice if i could self assign myself to certain issues.
Fine with me. Tell me what I need to do to give you the rights and I'll get it done.
I should be able to if i'm contributor rank @dteviot
@belldandu you should be a contributor now. If not, please let me know.
Well, I'll just leave this here.
WebToEpub v0.0.8 encounters a parsing error on this page: Utsuro_no_Hako:Volume2_May_2 (and Utsuro_no_Hako:Volume2, which includes it).
Screenshot: https://i.imgur.com/guVoRXM.png
@dteviot i spelled that wrong its collaborator @dreamer2908 i'm looking into that.
@belldandu try this.https://github.com/dteviot/WebToEpub/invitations
@dteviot there we go.
@dreamer2908
WebToEpub v0.0.8 encounters a parsing error on this page: Utsuro_no_Hako:Volume2_May_2 (and Utsuro_no_Hako:Volume2, which includes it).
D'oh! Fixed. My apologies for not noticing this sooner.
@dteviot Thanks. I completely forgot about this.
I checked out version 0.0.0.14, and it indeed no longer throws errors. But I noticed something strange: texts in part "May 2nd (Saturday) 00:31" are all italic in Baka-Tsuki, but in the generated epub, only the last sentence is italic.
@dreamer2908
I noticed something strange: texts in part "May 2nd (Saturday) 00:31" are all italic in Baka-Tsuki, but in the generated epub, only the last sentence is italic.
That's odd. I'll add investigating to my ToDo list.
@dreamer2908
I'm looking at fixing this issue
> Images can be embedded in B-T stories in form of inline images instead of thumbnails. The result xhtml code will be (slightly) invalid if WebToEpub encounters this type of images: div tag is inside p
This occurs because I'm wrapping the <svg> element in a <div class=”svg_outer svg_inner”>. I'm wrapping it in a <div> so that a style is applied to the <div>.
div.svg_outer {
display: block;
margin-bottom: 0;
margin-left: 0;\r
margin-right: 0;\
margin-top: 0;\r
padding-bottom: 0;
padding-left: 0;
padding-right: 0;
padding-top: 0;
text-align: left;
}
div.svg_inner {
display: block;
text-align: center;
}
The reason I'm doing this is because Lord Simon told me to do this. (He's the one who wrote BTE-GEN.)
An obvious fix (to me) would be to not have a wrapping <div> tag and apply the style directly to the <div> element. For that matter, I'm also puzzled why there's both a svg_outer and svg_inner style.
Anyway, my knowledge of CCS is extremely limited (as you've probably guessed by my above statements) so I'm hoping you could tell me WHY Lord Simon told me to do this, and why the changes I've suggested would be a bad idea. Failing that, can you point me in the direction of some good CCS documentation?
Thanks for your time.
@dreamer2908
I noticed something strange: texts in part "May 2nd (Saturday) 00:31" are all italic in Baka-Tsuki, but in the generated epub, only the last sentence is italic.
OK, I know what's happening here. The entire chapter, except for the final line is wrapped in a <i> tag. i.e. the chapter looks like this
<i>
<h3>May 2nd (Saturday) 00:31<h3>
<p>Exactly 15 minutes …
</p>
</i>
<p><i>I think that makes us a perfect match, … </i></p>
But one of the steps of WebToEpub is to “flatten” the HTML, so that all header tags are immediate children of the body, so the italic tag is being discarded.
In this case, rather than trying to fix WebToEpub, I'm going to suggest the easiest way to fix edit the page on Baka-Tsuki, moving the <i> to after the </h3> tag.
I will attempt to make the change.
@dteviot
About inline images and div
inside p
, rather than changing the way you handle images, I think it's easier to find a suitable place for it. Either moving the image out of p
before processing, or doing some sanity checking like whether div
can really be inserted there would do.
About the italic stuff, well, it's indeed easy to fix the page on Baka-Tsuki. I already know what changes to make to the page, if you want to end the case with this. But erroneous html is everywhere (i
isn't even allowed to wrap p
), so some degree of error correction will be necessary eventually.
@dreamer2908
You might like to take the latest version of the Sonako branch https://github.com/dteviot/WebToEpub/tree/sonako for a spin, I've been busy today.
rather than changing the way you handle images, I think it's easier to find a suitable place for it. Either moving the image out of p before processing
Yes, that's what I'm doing now. If parent is a <p> put the image before the tag.
WebToEpub doesn't convert the deprecated u tag (underline)
It does now.
center tag isn't allowed in epub, too
Also fixed.
If you remove cite-accessibility-label (class), the Jump up text will stop popping up out of nowhere
Done
Invalid id in span tag inside h* tag are not fixed, like
Those links were only needed for the table of contents on the original page. As they're no longer needed (page is split on Header tags) I'm removing them. (At least, the code is now supposed to remove them.)
About the italic stuff, well, it's indeed easy to fix the page on Baka-Tsuki. I already know what changes to make to the page.
So do I. Looks like someone put an open italic command at the start of the precceeding chapter, and didn't close it until the end of the following chapter. So there's two chapters in italics. In this case, I think it's an error by the translator. That is, the chapters are not supposed to be italic. I've sent a PM to the translator and we'll see what happens. My guess is nothing.
But erroneous html is everywhere (i isn't even allowed to wrap p), so some degree of error correction will be necessary eventually.
Agreed, error handling will be necessary. However, in this case, I think what the parser is doing is reasonable. (Discarding the weird italic tag.) But if you find other cases where the parser has problems please let me know.
@dteviot
I've checked out the latest sonako branch, and it seems to work as expected. GJ.
But if you find other cases where the parser has problems please let me know.
Well, if similar weirdness remaining unfixed is considered problems.
It seems that Baka-Tsuki would output weird html if a long section is italic/bold/etc and there's anything that is not text inside. Example: HEAVY_OBJECT:Volume11_Chapter_3#Part_12. The weirdness still remains in the generated epub.
This kind of usage of italic/bold is awfully familiar that I'm afraid it's everywhere.
@dreamer2908
It seems that Baka-Tsuki would output weird html if a long section is italic/bold/etc and there's anything that is not text inside. Example: HEAVY_OBJECT:Volume11_Chapter_3#Part_12. The weirdness still remains in the generated epub.
This kind of usage of italic/bold is awfully familiar that I'm afraid it's everywhere.
I'm going to call it a bug. As this incident has so many issues in it I'm starting to loose track of them all I'm raising this as a new issue.
I could have sworn i already fixed this in an earlier commit.
You tried, it didn't work properly. There were two problems.
Uhhhh doesn't removing them also break the citations at the bottom of the page?
If you mean footnotes, I'm only removing the ids that are not referred to. Footnotes seem to have valid IDs.
On Sat, Jul 30, 2016 at 6:45 AM, Belldandu notifications@github.com wrote:
Invalid id in span tag inside h* tag are not fixed, like
Those links were only needed for the table of contents on the original page. As they're no longer needed (page is split on Header tags) I'm removing them. (At least, the code is now supposed to remove them.)
Uhhhh doesn't removing them also break the citations at the bottom of the page?
I could have sworn i already fixed this in an earlier commit.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dteviot/WebToEpub/issues/32#issuecomment-236262220, or mute the thread https://github.com/notifications/unsubscribe-auth/AE6w2U-N64oCFTuvzlnSIJ0WWuYIP1joks5qakpkgaJpZM4JAwH7 .
@dreamer2908
BTE-GEN moves up heading if higher levels are missing, i.e h2 to h1, h3 to h2 if there's no h1. Can this be considered?
Done in latest commit to Sonako branch.
align attribute in p/span/div should be converted into css style text-align:
Any chance you can locate an example or two of this please? I haven't found an example yet.
@dteviot
Here: Leviathan:Volume_5_Afterword
I've just looked at the wikitext and it turns out that the translator used a weird way to right align text. Feel free to skip this.
Just a heads up @dteviot collections hit so I didn't get in ;-; and I have a job interview this Friday at 1 pm. Also my computer broke.
div
tag is insidep
tag.Example: All non-gallery images here: Utsuro no Hako:Volume 1
Result xhtml code for the first image:
<p><div class="svg_outer svg_inner"><svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" height="100%" width="100%" version="1.1" preserveAspectRatio="xMidYMid meet" viewBox="0 0 1368 1000"><image xlink:href="../Images/0006_Utsuro_no_..._vol1_pic1.jpg" height="1000" width="1368"/><desc>https://www.baka-tsuki.org/project/index.php?title=File:Utsuro_no_Hako_vol1_pic1.jpg</desc></svg></div> </p>
Epubcheck error message:
ERROR: /home/yumi/Downloads/Utsuro_no_...koVolume_1.epub/OEBPS/Text/0000_Novel_Illustrations.xhtml(2,34): element "div" not allowed here; expected the element end-tag, text or element "a", "abbr", "acronym", "applet", "b", "bdo", "big", "br", "cite", "code", "del", "dfn", "em", "i", "iframe", "img", "ins", "kbd", "map", "noscript", "ns:svg", "object", "q", "samp", "script", "small", "span", "strong", "sub", "sup", "tt" or "var" (with xmlns:ns="http://www.w3.org/2000/svg")
u
tag (underline) into suitable form for epub.<p>Normal <u>underline></u></p>
should become<p>Normal <span style="text-decoration: underline;">underline></span></p>
Sample: same as above.
Epubcheck error message:
ERROR: /home/yumi/Downloads/Utsuro_no_...koVolume_1.epub/OEBPS/Text/0001_Prologue.xhtml(4,85): element "u" not allowed anywhere; expected the element end-tag, text or element "a", "abbr", "acronym", "applet", "b", "bdo", "big", "br", "cite", "code", "del", "dfn", "em", "i", "iframe", "img", "ins", "kbd", "map", "noscript", "ns:svg", "object", "q", "samp", "script", "small", "span", "strong", "sub", "sup", "tt" or "var" (with xmlns:ns="http://www.w3.org/2000/svg")
span
tag inside h* tag are not fixed, like<h3><span class="mw-headline" id="1st_time">1<sup>st</sup> time</span></h3>
Epubcheck error message:
ERROR: /home/yumi/Downloads/Utsuro_no_...koVolume_1.epub/OEBPS/Text/0002_1st_time.xhtml(1,497): value of attribute "id" is invalid; must be an XML name without colons
Side note: BTE-GEN converts it into
<h3 id="1st_time">
, but it's still not fixed, and not useful here.Well, some more, but I lost the samples.
center
tag isn't allowed in epub, too.<center>text</center>
should become<p style="text-align: center;"></p>
align
attribute in p/span/div should be converted into css styletext-align:
BTE-GEN moves up heading if higher levels are missing, i.e
h2
toh1
,h3
toh2
if there's noh1
. Can this be considered?In list of references (translator's notes) in B-T web, the link to jump up to where the reference belongs to only has a single
↑
symbol. The same in BTE-GEN's output. In WebToEpub's output, it becomesJump up ↑
. If you removecite-accessibility-label
(class), theJump up
text will stop popping up out of nowhere.Full disclose: I'm developing my own (not easy-to-use) Baka-Tsuki to epub converter, which is for freaks like me, and not for normal users at all.