Open flittle8 opened 5 years ago
@zwettemaan do you need more details about this one from me?
Ok, I added a new DropScript called 'MakeBreaksConfirm'. This does a search-and-replace for <a> tags of a very specific shape and replaces them with spans.
See
I'll now make another DropScript to build the page list.
Hi @flittle8, I think I got something working. It seems to work OK on El Capitan too.
There are two parts to this:
MakeBreaksConform will perform multiple search-and-replace and is meant to 'massage' the page breaks into a consistent, conformant break.
These might need some further tweaking, but they do work on the sample files. See the config file:
If you look at the config file, I've documented the subsequent steps. Each search and replace is performed on the whole HTML file, and then the next search and replace...
1) convert any <a....></a> into <a.../> notation (collapse empty a tags) 2) convert any <a.../> into <a... /> (add a space before the slash). 3) convert any <span....></span> into <span.../> notation (collapse empty span tags) 4) convert any <span.../> into <span... /> (add a space before the slash). 5) convert any a tags that have an ID that indicates them being a page break into span tags and add 'epub:type=\"pagebreak\"' to these span tags 6) find any span tags that already have an aria-label and rename them to Span (uppercase 'S'). This takes them out of the running for further span-matches in my list 7) add an aria-label to any span tags that have an ID that indicates them being a page break 8) find any span tags that already have a role and rename them to SPan (uppercase 'SP'). This takes them out of the running for further span-matches in my list 9) Insert a role attribute in any page break spans that are not 'Span' or 'SPan' yet. 10) Finally, rename SPan and Span back to span, so all are now back to lowercase
The second part is
It will generate a list of all conformant page breaks. The Completion Report now has a zone for reporting, so you can copy-paste the results into a text editor or so.
@zwettemaan Thanks. Should I test this out as is, or wait till it's combined into a single script?
I'd prefer to keep it as two separate scripts. The first one is 'modifying' - i.e. it modifies the EPUB, and the output must be verified by human eyeballs. It's just a bunch of mindless GREP statements, and those cannot be trusted on their own without some oversight.
The second script is 'reporting'. It does not modify the EPUB at all, so you can run it over and over again and not worry about what it might have done to the EPUB.
@zwettemaan Ok, I'll test it out. I assume the 2nd script will insert the identified page breaks into the NAV as a page list?
@flittle8 No, sorry, I did not get that far. The output is shown in the window. From there you'll need to copy-paste that into Sublime or so and do some find/replace.
@zwettemaan a couple things I noticed:
The MakeBreaksConform script generated this:
<span role="doc-pagebreak" aria-label="72" epub:type="pagebreak" id="page72" title="72"></span>
which should be this instead:
<span role="doc-pagebreak" aria-label="72" epub:type="pagebreak" id="page72" title="72" />
For the SearchandReport script results, I was expecting content that I could paste into a Page List in the NAV. Could you generate something like this:
<li><a href="../Text/KidsInTheHall_ISTC_int_ebook-34.xhtml#pagei">i</a></li>
<li><a href="../Text/KidsInTheHall_ISTC_int_ebook-34.xhtml#pageii">ii</a></li>
<li><a href="../Text/KidsInTheHall_ISTC_int_ebook-34.xhtml#pageiii">iii</a></li>
<li><a href="../Text/KidsInTheHall_ISTC_int_ebook-1.xhtml#pageiv">iv</a></li>
instead of:
KidsInTheHall_ISTC_int_ebook-34.xhtml: i
KidsInTheHall_ISTC_int_ebook-34.xhtml: ii
KidsInTheHall_ISTC_int_ebook-34.xhtml: iii
KidsInTheHall_ISTC_int_ebook-5.xhtml: viii
The results of this script displayed out of order which makes it more difficult to utilize. Could you display them in order of occurrence in the book?
KidsInTheHall_ISTC_int_ebook-31.xhtml: 332
KidsInTheHall_ISTC_int_ebook-31.xhtml: 333
KidsInTheHall_ISTC_int_ebook-31.xhtml: 334
KidsInTheHall_ISTC_int_ebook-31.xhtml: 335
KidsInTheHall_ISTC_int_ebook-32.xhtml: 337
KidsInTheHall_ISTC_int_ebook-34.xhtml: i
KidsInTheHall_ISTC_int_ebook-34.xhtml: ii
KidsInTheHall_ISTC_int_ebook-34.xhtml: iii
KidsInTheHall_ISTC_int_ebook-5.xhtml: viii
KidsInTheHall_ISTC_int_ebook-5.xhtml: ix
KidsInTheHall_ISTC_int_ebook-6.xhtml: xi
KidsInTheHall_ISTC_int_ebook-6.xhtml: xii
KidsInTheHall_ISTC_int_ebook-6.xhtml: xiii
KidsInTheHall_ISTC_int_ebook-7.xhtml: 1
KidsInTheHall_ISTC_int_ebook-7.xhtml: 2
KidsInTheHall_ISTC_int_ebook-7.xhtml: 3
KidsInTheHall_ISTC_int_ebook-7.xhtml: 4
Hi Farrah, can you provide a sample? I've put in GREP that was supposed to make <span />. If that's not happening, I need to see samples to figure out why that is.
Outputting a tag list is not that hard - I can probably add that without spending too much time. but ordering the entries is not a simple thing for a number of reasons, and it would take me multiple hours of work to get done. First reason is that currently the scripts work 'per-file', but to order stuff we need to look across all files - and that is a feature I need to add to DropToScript; right now the files are processed in random order. Second reason is that ordering by roman numerals is not a straightforward simple thing: I need to temporarily convert them back to decimal in order to sort.
@zwettemaan I can mail you this EPUB as well? It's strange - when I open the output EPUB in Sigil I see something like this: <span role="doc-pagebreak" aria-label="133" epub:type="pagebreak" id="page133" title="133"></span>
.
When I unzip the EPUB and open up an individual HTML file, I see this: <span role="doc-pagebreak" aria-label="133" epub:type="pagebreak" id="page133" title="133" />
Outputting a tag list is not that hard - I can probably add that without spending too much time.
Thanks!
I've noticed that Sigil often 'lies': it often shows tags differently than what the real tag in the file is. It does the same with the headers: it seems to show headers that are not there. Some of the first files you sent me (forgot which) were different when unzipped vs. Sigil. I think Sigil actually runs the HTML through some process before showing it.
I've noticed that Sigil often 'lies': it often shows tags differently than what the real tag in the file is. It does the same with the headers: it seems to show headers that are not there. Some of the first files you sent me (forgot which) were different when unzipped vs. Sigil. I think Sigil actually runs the HTML through some process before showing it.
Yes. Can you get it to work on Sigil?
I am not sure what you mean? Sigil does what Sigil does. I think it is trying to be helpful, in case you want to type something in the span.
Try the following: open an EPUB in Oxygen (or decompress it). Put in a <span />.
Save (or recompress).
Open in Sigil. It will show <span></span>.
Close without saving and re-open in Oxygen. It will show <span />.
Oxygen shows you the truth - i.e. what is in the file (and what the scripts will process).
Sigil does not show you what is in the file. It pre-processes the file before letting you edit, so what you see in Sigil is not what is in the underlying file. That's how Sigil works. Once you save from Sigil, the span will be converted to an open-close span.
You can re-run the MakeBreaksConform script to undo the damage caused by Sigil, but as soon as you resave from Sigil, it will re-do the damage.
I cannot change the fact that Sigil does not respect the reality of the file. That's how Sigil was made.
@zwettemaan I ran this script on another EPUB and I get the below message. It seems to be that header issue but for this particular page only. The same error occurs when I open up the EPUB after running the LangTag script
also the lang="en" wasn't inserted in the html for this page #17
@flittle8 Can you send me the file? I cannot see the XHTML for the page copy.html in your screenshot, so I have no idea what might be wrong. The page shown in your screenshot is cover.html, not copy.html
let me know if you still need the file
Yeah, I'll need the file. I don't see anything obviously wrong.
Hmm... According to epubcheck it's the 'epub:...' prefix in the spans. If I adjust the GREP patterns by adding an additional entry
{
/*
* Eliminate epub:type
*/
"from": "~\\bepub:type=\\b~i",
"to": "data-epub-type="
}
which converts any epub:type="...
into data-epub-type="...
then Sigil won't complain.
Isn't that epub:... an EPUB 3 thing, which causes problems in an EPUB 2 file?
@zwettemaan Hmm. ARIA, like role="doc-pagebreak", is a EPUB 3 thing. epub:type semantics are fine with EPUB 2 & 3 (as far as I know...)
This is what epubcheck says:
winnie:epubcheck-4.2.0 kris$ java -jar epubcheck.jar /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub
Validating using EPUB version 2.0.1 rules.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/halftitle.html(9,99): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/halftitle.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/title.html(9,109): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/title.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/copy.html(9,109): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/copy.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/dedi.html(9,105): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/dedi.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/contents.html(9,112): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/contents.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/illu.html(9,110): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/illu.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/preface.html(9,109): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/preface.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/part01.html(9,107): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/part01.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch01.html(9,105): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch01.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch02.html(9,107): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch02.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/part02.html(9,109): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/part02.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch03.html(9,107): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch03.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch04.html(9,106): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch04.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/coda.html(9,108): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/coda.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/notes.html(9,110): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/notes.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/bib.html(9,108): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/bib.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ack.html(9,108): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ack.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
Check finished with errors
Messages: 17 fatals / 17 errors / 0 warnings / 0 infos
EPUBCheck completed
winnie:epubcheck-4.2.0 kris$
The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound
So, something is needed. Maybe a namespace definition that should be added to the header?
Hmm. I ran other EPUB 2's through this script without issue
maybe you need to add xmlns:epub="http://www.idpf.org/2007/ops"?
That's what Cleaner.php does. Try running it through Cleaner. That fixes it for me (it enforces an html tag with that name space included).
I ran it through Cleaner.php but it didn't fix the issue...
it's just that one HTML file, and the other pages are all okay it seems
epub-check also doesn't like the aria-label on the page breaks, I guess because it's a EPUB 2... I think anyone using this script should ensure they've updated their EPUB to v3 first before using?
Yes, that's my understanding...
Looks like it's an issue with Sigil 0.9.8. 0.9.13 shows no issues, 0.9.8 complains. So, I think that last page is simply an 'outdated Sigil' problem.
I'd like to be able to update existing page breaks in the ebook to EPUB 3 and then generate a page list from them.
The 1st step is to update the code to proper EPUB 3. See Example 3 for proper coding. An epub:type value should also be added, so:
<span epub:type="pagebreak" id="page24" role="doc-pagebreak" aria-label="24" />
The 2nd step is to generate a page list from these page breaks. A page list should be in the nav.xhtml file and look like Example 1.I've uploaded a few HTML pages to dropbox to show how page breaks can be coded within an ebook.