BCLibCoop / nnels-a11y-publishing

GNU Lesser General Public License v3.0
5 stars 0 forks source link

Page breaks: Update to EPUB 3 code and generate a page list #18

Open flittle8 opened 5 years ago

flittle8 commented 5 years ago

I'd like to be able to update existing page breaks in the ebook to EPUB 3 and then generate a page list from them.

The 1st step is to update the code to proper EPUB 3. See Example 3 for proper coding. An epub:type value should also be added, so: <span epub:type="pagebreak" id="page24" role="doc-pagebreak" aria-label="24" /> The 2nd step is to generate a page list from these page breaks. A page list should be in the nav.xhtml file and look like Example 1.

I've uploaded a few HTML pages to dropbox to show how page breaks can be coded within an ebook.

flittle8 commented 5 years ago

@zwettemaan do you need more details about this one from me?

zwettemaan commented 5 years ago

Ok, I added a new DropScript called 'MakeBreaksConfirm'. This does a search-and-replace for <a> tags of a very specific shape and replaces them with spans.

See

https://github.com/BCLibCoop/nnels-a11y-publishing/blob/master/DropScripts/MakeBreaksConform/MakeBreaksConform.ReadMe.md

I'll now make another DropScript to build the page list.

zwettemaan commented 5 years ago

Hi @flittle8, I think I got something working. It seems to work OK on El Capitan too.

There are two parts to this:

MakeBreaksConform will perform multiple search-and-replace and is meant to 'massage' the page breaks into a consistent, conformant break.

https://github.com/BCLibCoop/nnels-a11y-publishing/blob/master/DropScripts/MakeBreaksConform/MakeBreaksConform.ReadMe.md

These might need some further tweaking, but they do work on the sample files. See the config file:

https://github.com/BCLibCoop/nnels-a11y-publishing/blob/master/DropScripts/MakeBreaksConform/MakeBreaksConform.config.txt

If you look at the config file, I've documented the subsequent steps. Each search and replace is performed on the whole HTML file, and then the next search and replace...

1) convert any <a....></a> into <a.../> notation (collapse empty a tags) 2) convert any <a.../> into <a... /> (add a space before the slash). 3) convert any <span....></span> into <span.../> notation (collapse empty span tags) 4) convert any <span.../> into <span... /> (add a space before the slash). 5) convert any a tags that have an ID that indicates them being a page break into span tags and add 'epub:type=\"pagebreak\"' to these span tags 6) find any span tags that already have an aria-label and rename them to Span (uppercase 'S'). This takes them out of the running for further span-matches in my list 7) add an aria-label to any span tags that have an ID that indicates them being a page break 8) find any span tags that already have a role and rename them to SPan (uppercase 'SP'). This takes them out of the running for further span-matches in my list 9) Insert a role attribute in any page break spans that are not 'Span' or 'SPan' yet. 10) Finally, rename SPan and Span back to span, so all are now back to lowercase

The second part is

https://github.com/BCLibCoop/nnels-a11y-publishing/blob/master/DropScripts/SearchAndReport/SearchAndReport.ReadMe.md

It will generate a list of all conformant page breaks. The Completion Report now has a zone for reporting, so you can copy-paste the results into a text editor or so.

Screen Shot 2019-05-24 at 11 08 28 PM
flittle8 commented 5 years ago

@zwettemaan Thanks. Should I test this out as is, or wait till it's combined into a single script?

zwettemaan commented 5 years ago

I'd prefer to keep it as two separate scripts. The first one is 'modifying' - i.e. it modifies the EPUB, and the output must be verified by human eyeballs. It's just a bunch of mindless GREP statements, and those cannot be trusted on their own without some oversight.

The second script is 'reporting'. It does not modify the EPUB at all, so you can run it over and over again and not worry about what it might have done to the EPUB.

flittle8 commented 5 years ago

@zwettemaan Ok, I'll test it out. I assume the 2nd script will insert the identified page breaks into the NAV as a page list?

zwettemaan commented 5 years ago

@flittle8 No, sorry, I did not get that far. The output is shown in the window. From there you'll need to copy-paste that into Sublime or so and do some find/replace.

flittle8 commented 5 years ago

@zwettemaan a couple things I noticed:

The MakeBreaksConform script generated this: <span role="doc-pagebreak" aria-label="72" epub:type="pagebreak" id="page72" title="72"></span>

which should be this instead:

<span role="doc-pagebreak" aria-label="72" epub:type="pagebreak" id="page72" title="72" />

For the SearchandReport script results, I was expecting content that I could paste into a Page List in the NAV. Could you generate something like this:

<li><a href="../Text/KidsInTheHall_ISTC_int_ebook-34.xhtml#pagei">i</a></li>
<li><a href="../Text/KidsInTheHall_ISTC_int_ebook-34.xhtml#pageii">ii</a></li>
<li><a href="../Text/KidsInTheHall_ISTC_int_ebook-34.xhtml#pageiii">iii</a></li>
<li><a href="../Text/KidsInTheHall_ISTC_int_ebook-1.xhtml#pageiv">iv</a></li>

instead of:

KidsInTheHall_ISTC_int_ebook-34.xhtml: i
KidsInTheHall_ISTC_int_ebook-34.xhtml: ii
KidsInTheHall_ISTC_int_ebook-34.xhtml: iii
KidsInTheHall_ISTC_int_ebook-5.xhtml: viii

The results of this script displayed out of order which makes it more difficult to utilize. Could you display them in order of occurrence in the book?

KidsInTheHall_ISTC_int_ebook-31.xhtml: 332
KidsInTheHall_ISTC_int_ebook-31.xhtml: 333
KidsInTheHall_ISTC_int_ebook-31.xhtml: 334
KidsInTheHall_ISTC_int_ebook-31.xhtml: 335
KidsInTheHall_ISTC_int_ebook-32.xhtml: 337
KidsInTheHall_ISTC_int_ebook-34.xhtml: i
KidsInTheHall_ISTC_int_ebook-34.xhtml: ii
KidsInTheHall_ISTC_int_ebook-34.xhtml: iii
KidsInTheHall_ISTC_int_ebook-5.xhtml: viii
KidsInTheHall_ISTC_int_ebook-5.xhtml: ix
KidsInTheHall_ISTC_int_ebook-6.xhtml: xi
KidsInTheHall_ISTC_int_ebook-6.xhtml: xii
KidsInTheHall_ISTC_int_ebook-6.xhtml: xiii
KidsInTheHall_ISTC_int_ebook-7.xhtml: 1
KidsInTheHall_ISTC_int_ebook-7.xhtml: 2
KidsInTheHall_ISTC_int_ebook-7.xhtml: 3
KidsInTheHall_ISTC_int_ebook-7.xhtml: 4
zwettemaan commented 5 years ago

Hi Farrah, can you provide a sample? I've put in GREP that was supposed to make <span />. If that's not happening, I need to see samples to figure out why that is.

Outputting a tag list is not that hard - I can probably add that without spending too much time. but ordering the entries is not a simple thing for a number of reasons, and it would take me multiple hours of work to get done. First reason is that currently the scripts work 'per-file', but to order stuff we need to look across all files - and that is a feature I need to add to DropToScript; right now the files are processed in random order. Second reason is that ordering by roman numerals is not a straightforward simple thing: I need to temporarily convert them back to decimal in order to sort.

flittle8 commented 5 years ago

@zwettemaan I can mail you this EPUB as well? It's strange - when I open the output EPUB in Sigil I see something like this: <span role="doc-pagebreak" aria-label="133" epub:type="pagebreak" id="page133" title="133"></span>. When I unzip the EPUB and open up an individual HTML file, I see this: <span role="doc-pagebreak" aria-label="133" epub:type="pagebreak" id="page133" title="133" />

flittle8 commented 5 years ago

Outputting a tag list is not that hard - I can probably add that without spending too much time.

Thanks!

zwettemaan commented 5 years ago

I've noticed that Sigil often 'lies': it often shows tags differently than what the real tag in the file is. It does the same with the headers: it seems to show headers that are not there. Some of the first files you sent me (forgot which) were different when unzipped vs. Sigil. I think Sigil actually runs the HTML through some process before showing it.

flittle8 commented 5 years ago

I've noticed that Sigil often 'lies': it often shows tags differently than what the real tag in the file is. It does the same with the headers: it seems to show headers that are not there. Some of the first files you sent me (forgot which) were different when unzipped vs. Sigil. I think Sigil actually runs the HTML through some process before showing it.

Yes. Can you get it to work on Sigil?

zwettemaan commented 5 years ago

I am not sure what you mean? Sigil does what Sigil does. I think it is trying to be helpful, in case you want to type something in the span.

Try the following: open an EPUB in Oxygen (or decompress it). Put in a <span />.

Save (or recompress).

Open in Sigil. It will show <span></span>.

Close without saving and re-open in Oxygen. It will show <span />.

Oxygen shows you the truth - i.e. what is in the file (and what the scripts will process).

Sigil does not show you what is in the file. It pre-processes the file before letting you edit, so what you see in Sigil is not what is in the underlying file. That's how Sigil works. Once you save from Sigil, the span will be converted to an open-close span.

You can re-run the MakeBreaksConform script to undo the damage caused by Sigil, but as soon as you resave from Sigil, it will re-do the damage.

I cannot change the fact that Sigil does not respect the reality of the file. That's how Sigil was made.

flittle8 commented 5 years ago

@zwettemaan I ran this script on another EPUB and I get the below message. It seems to be that header issue but for this particular page only. The same error occurs when I open up the EPUB after running the LangTag script file-note-wellformed-copyhtml-makebreaksconform

flittle8 commented 5 years ago

also the lang="en" wasn't inserted in the html for this page #17

zwettemaan commented 5 years ago

@flittle8 Can you send me the file? I cannot see the XHTML for the page copy.html in your screenshot, so I have no idea what might be wrong. The page shown in your screenshot is cover.html, not copy.html

flittle8 commented 5 years ago

let me know if you still need the file copyhtml

zwettemaan commented 5 years ago

Yeah, I'll need the file. I don't see anything obviously wrong.

zwettemaan commented 5 years ago

Hmm... According to epubcheck it's the 'epub:...' prefix in the spans. If I adjust the GREP patterns by adding an additional entry

{
            /* 
            * Eliminate epub:type
            */
            "from": "~\\bepub:type=\\b~i",
            "to": "data-epub-type="
        }

which converts any epub:type="... into data-epub-type="... then Sigil won't complain.

Isn't that epub:... an EPUB 3 thing, which causes problems in an EPUB 2 file?

flittle8 commented 5 years ago

@zwettemaan Hmm. ARIA, like role="doc-pagebreak", is a EPUB 3 thing. epub:type semantics are fine with EPUB 2 & 3 (as far as I know...)

zwettemaan commented 5 years ago

This is what epubcheck says:

winnie:epubcheck-4.2.0 kris$ java -jar epubcheck.jar /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub 
Validating using EPUB version 2.0.1 rules.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/halftitle.html(9,99): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/halftitle.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/title.html(9,109): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/title.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/copy.html(9,109): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/copy.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/dedi.html(9,105): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/dedi.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/contents.html(9,112): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/contents.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/illu.html(9,110): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/illu.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/preface.html(9,109): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/preface.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/part01.html(9,107): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/part01.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch01.html(9,105): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch01.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch02.html(9,107): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch02.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/part02.html(9,109): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/part02.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch03.html(9,107): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch03.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch04.html(9,106): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ch04.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/coda.html(9,108): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/coda.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/notes.html(9,110): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/notes.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/bib.html(9,108): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/bib.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
FATAL(RSC-016): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ack.html(9,108): Fatal Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.
ERROR(RSC-005): /Users/kris/Desktop/Wittgensteins_Ethics_and_Modern_19050713545474_LanginHTMLbutwanttoreplace.epub/ops/xhtml/ack.html(-1,-1): Error while parsing file: The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound.

Check finished with errors
Messages: 17 fatals / 17 errors / 0 warnings / 0 infos

EPUBCheck completed
winnie:epubcheck-4.2.0 kris$ 
zwettemaan commented 5 years ago
The prefix "epub" for attribute "epub:type" associated with an element type "span" is not bound

So, something is needed. Maybe a namespace definition that should be added to the header?

flittle8 commented 5 years ago

Hmm. I ran other EPUB 2's through this script without issue

flittle8 commented 5 years ago

maybe you need to add xmlns:epub="http://www.idpf.org/2007/ops"?

zwettemaan commented 5 years ago

That's what Cleaner.php does. Try running it through Cleaner. That fixes it for me (it enforces an html tag with that name space included).

flittle8 commented 5 years ago

I ran it through Cleaner.php but it didn't fix the issue...

flittle8 commented 5 years ago

it's just that one HTML file, and the other pages are all okay it seems

flittle8 commented 5 years ago

epub-check also doesn't like the aria-label on the page breaks, I guess because it's a EPUB 2... I think anyone using this script should ensure they've updated their EPUB to v3 first before using?

zwettemaan commented 5 years ago

Yes, that's my understanding...

zwettemaan commented 5 years ago

Looks like it's an issue with Sigil 0.9.8. 0.9.13 shows no issues, 0.9.8 complains. So, I think that last page is simply an 'outdated Sigil' problem.