BCLibCoop / nnels-a11y-publishing

GNU Lesser General Public License v3.0
5 stars 0 forks source link

Cleaner script damages EPUB #19

Open zwettemaan opened 5 years ago

zwettemaan commented 5 years ago

@LauraB7 Please provide me with the EPUB that got mangled by the Cleaner script, so I can investigate...

LauraB7 commented 5 years ago

Sure thing, @zwettemaan. This zip file has the raw EPUB, and the post-Cleaner EPUB. Archive.zip

zwettemaan commented 5 years ago

Kewl, thanks!

zwettemaan commented 5 years ago

Hi @LauraB7 - what am I looking at? When I compare the two they seem to be identical (as they should be)? If there is nothing wrong with the headers, the cleaner will leave the file alone.

Can you elaborate on what exactly was wrong after you ran the Cleaner script?

LauraB7 commented 5 years ago

@zwettemaan: I messed that attachment up. My work laptop doesn't have a tonne of memory so I had already trashed the affected EPUB. I will try to recreate the problem tomorrow, if you still need it.

zwettemaan commented 5 years ago

Yes, please. Any sign of odd behavior needs to be investigated. Very often, things work fine on my examples and my workstation, but that means nothing: there are a lot of factors I cannot control that might make the scripts misbehave. By running it in all kinds of different environments we can try and make it all more robust.

When working with @flittle8 we've experienced first hand how seemingly innocuous things like a slightly older Mac OS X version or a slightly older Sigil version can throw major spanners in the works.

Hence, yes, please: try to re-create it.

LauraB7 commented 5 years ago

Will do. I will post it Friday morning.

zwettemaan commented 5 years ago

Ha. I found some issues that might have been what you saw using some of Farrah's files. Try the latest version - the issue might be fixed...

https://github.com/BCLibCoop/nnels-a11y-publishing/tree/master/ReleaseVersions

LauraB7 commented 5 years ago

So far as I can tell, @zwettemaan, the Cleaner script still does something to the declaration. I am attaching here the pre- and post-Cleaner EPUBs for you to have a look at.

Archive.zip

zwettemaan commented 5 years ago

Hi @LauraB7, I think that's 'as designed' (which is not the same as 'sensible' :-( It means I thought it might have been a good idea, but I just made that up).

Cleaner will add or reset the headers to a standard header which comes from a replacement instruction in the GREP:

https://github.com/BCLibCoop/nnels-a11y-publishing/blob/master/DropScripts/Cleaner/Cleaner.config.txt

If you don't want the enforced HTML header, you could change to the following config (untested) instead:

{
    "replacements": [
        {
          // Strip old headers
            "from": "~(((\\s*<![^>]*>)|(\\s*<\\?[^>]*\\?>))+\\s*)~si",
            "to": ""
        },
        {
          // Add new headers
            "from": "~\\s*([\\s\\S]*\\S)\\s*~si",
            "to": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">\n$1"
        }
    ]
}

Essentially, the Cleaner script and the MakeBreaksConform are exactly the same script, just with a different config: they are both a sequence of find-replace operations.

By adding or removing search-and-replace patterns we can make them do more or less...

We could have a whole bunch of these scripts, targeting different issues...