BCLibCoop / nnels-a11y-publishing

GNU Lesser General Public License v3.0
5 stars 0 forks source link

Add a meaningful title to the HTML file #2

Closed zwettemaan closed 5 years ago

flittle8 commented 5 years ago

Goal:

  1. Take the <h1> in a file and move it into the <title> element, replacing the text that currently exists in the Title element (if anything).
LauraB7 commented 5 years ago

InDesign's reflex is to populate the <title> element in each HTML file with an iteration of the InDesign file name. It's not meaningful in that it does not describe the content -- i.e., chapter two.

zwettemaan commented 5 years ago

@LauraB7, @flittle8 can you attach some sample files to this GitHub issue? How do you see this working? I can set up a drag/drop tool that would execute this kind of operation, but what would be most useful?

E.g. I can make a drag/drop tool which you'd drag an HTML file on, and it'd find the <h1> and copy it into the <title>

Or I could make a drag/drop tool which you'd drag an EPUB on, and it'd find all HTML inside and perform this operation. This is more effort than the previous option.

I envision you'd probably use this on an 'exploded' EPUB, and use it to manage individual HTML files.

Would it be better to make the tool show a dialog for each HTML file to confirm the title it has selected? That way you can click OK or first edit the title, then click OK.

LauraB7 commented 5 years ago

A drag-and-drop tool would definitely do the trick. The key is to wipe out what InDesign populates that field with; i.e., FileName-1, FileName-2, etc. See the attached rough EPUB export for a sample.

9781487006730_EPUB.epub.zip

zwettemaan commented 5 years ago

I've made a first stab at this. It's not satisfactory, but it has some functionality.

To try out on a Mac (I'll document this soon and will then also cover Windows and Linux):

1) Download DropToScript from

https://github.com/BCLibCoop/nnels-a11y-publishing/raw/master/ReleaseVersions/DropToScript.1.0.0.zip

2) Decompress and pick the right version (Mac, Windows, Linux)

3) Go into the folder for your platform, where you see the DropToScript application icon

4) Create a new empty folder 'DropScripts'

5) Right-click and save the following file into this DropScripts folder:

https://raw.githubusercontent.com/BCLibCoop/nnels-a11y-publishing/master/ReleaseVersions/DropScripts/AutoTitle/AutoTitle.php

E.g. on Mac:

Screen Shot 2019-04-15 at 10 18 33 PM

6) Drag/drop a bunch of XHTML files on the DropToScript app icon.

zwettemaan commented 5 years ago

More info here:

https://github.com/BCLibCoop/nnels-a11y-publishing/wiki/DropToScript-Documentation

The current AutoTitle.php script is not finished; you'll get some error dialogs on some files, but it is already somewhat useful.

Issues that I found are:

To fix this, I am planning on the following changes:

@LauraB7 I'll do some videos later, once it's all more finished.

zwettemaan commented 5 years ago

Ok, the script should now be smart enough to be useful. Please see the docs

https://github.com/BCLibCoop/nnels-a11y-publishing/blob/master/DropScripts/AutoTitle/AutoTitle.ReadMe.md

https://github.com/BCLibCoop/nnels-a11y-publishing/wiki/DropToScript-Documentation

Quick and dirty video here (3 parts 'DropToScript...')

https://app.box.com/folder/72916582816

zwettemaan commented 5 years ago

Please have a look at:

https://github.com/BCLibCoop/nnels-a11y-publishing/wiki/AutoTitle-Documentation https://github.com/BCLibCoop/nnels-a11y-publishing/wiki/DropToScript-Documentation https://github.com/BCLibCoop/nnels-a11y-publishing/wiki/DropScript-Template-Documentation

flittle8 commented 5 years ago

@zwettemaan I downloaded the latest DropToScript release and ran an epub through the AutoTitle however it didn't work (it didn't replace the title element with the top-level headings; no changes were made). Github won't let me attach the HTML file here (unsupported file type) so I've uploaded it to dropbox.

zwettemaan commented 5 years ago

Hi Farrah,

Have you tried setting

"forcedReplaceTitle": 0

in the config.txt file for AutoTitle?

To attach an HTML to a GitHub issue you have to first zip it (compress it). Right-click it on your Mac, and select 'Compress file...'.

zwettemaan commented 5 years ago

In the meantime, I'll have a look at the file... I might have broken something when I added support for whole EPUB files.

flittle8 commented 5 years ago

@zwettemaan yes, the "forcedReplaceTitle": 0 is already there

flittle8 commented 5 years ago

good to know about zipping HTML files, thanks :)

zwettemaan commented 5 years ago

Oops. I meant to say: try setting forcedReplaceTitle to 1. That disables the test for the file name vs the title.

flittle8 commented 5 years ago

@zwettemaan ok, I changed it to 1 and re-ran it. Still no luck. It didn't work on either of these files 41_Chapter_34.html.zip chapter06.xhtml.zip

zwettemaan commented 5 years ago

Cool, thanks for trying. I'll look into it ASAP...

flittle8 commented 5 years ago

@zwettemaan I think we need an option here to move the first heading in the file (or first p tag content) into the <title> regardless of filenames. Is that what setting the forcedReplaceTitle to 1 is supposed to allow for?

flittle8 commented 5 years ago

@zwettemaan also, on a related note, if you're taking the content of the first p tag we should ensure there is actually content in the p tag. if it's empty then skip to the 1st p tag with text in it. i've noticed that sometimes ebooks have empty p tags at the beginning of the page (I assume the publisher's way of adding extra spacing)

zwettemaan commented 5 years ago

Hi Farrah; I checked the two files and it actually works as designed; it's just that 'as designed' is not very useful in this case. The script needs more thought.

The issue is that both files have nested tags. So, if you edit the config.txt file and set keepTitleSubtags to 1 as well as forcedReplaceTitle, it works, sort of.

Chapter06 finds ' Rain Delay ' and 41_Chapter_34 finds '11'.

Neither substitutions look very useful to me.

At present, I'd say AutoTitle is only useful on InDesign-generated EPUBs. Other EPUBs have too much variability to them, and the odds of the tool picking something silly as a title are too high. E.g. in 41_Chapter_34, 11 is a logical choice for a dumb script that has no comprehension of the text in the document.

zwettemaan commented 5 years ago

The script does not know any better.

zwettemaan commented 5 years ago

Is that what setting the forcedReplaceTitle to 1 is supposed to allow for?

Yes

zwettemaan commented 5 years ago

See

https://github.com/BCLibCoop/nnels-a11y-publishing/blob/master/DropScripts/AutoTitle/AutoTitle.ReadMe.md

zwettemaan commented 5 years ago

also, on a related note, if you're taking the content of the first p tag we should ensure there is actually content in the p tag

Good point. Hadn't considered that yet, but that is easy to fix - I'll add that in.

flittle8 commented 5 years ago

@zwettemaan so I set both keepTitleSubtags and forcedReplaceTitle to 1 and it's not working for me... I can email you the EPUB if that helps.

The issue is that both files have nested tags. So, if you edit the config.txt file and set keepTitleSubtags to 1 as well as forcedReplaceTitle, it works, sort of.

Chapter06 finds ' Rain Delay ' and 41_Chapter_34 finds '11'.

flittle8 commented 5 years ago

Also, is there a reason why we can't just set the default to ignore nested tags? All tags should be ignored and only text extracted and put into the title

zwettemaan commented 5 years ago

Also, is there a reason why we can't just set the default to ignore nested tags? All tags should be ignored and only text extracted and put into the title

The only reason was that I did not have much test-material to go on, and if I set that flag, the sample Laura provided me came out mangled because it has chapter numbers prefixed to the titles.

I think I'll probably need to refine this so we can tell it to only suppress very specific CSS classes.

zwettemaan commented 5 years ago

Yes, provide me the EPUB. Just to make sure: you are trying the latest version, right? If not, make sure to download it:

https://github.com/BCLibCoop/nnels-a11y-publishing/tree/master/ReleaseVersions

The latest is 1.0.3_1.0.3

flittle8 commented 5 years ago

@zwettemaan That's fine if both chapter number and chapter title are in the element, in fact I think that's how it should be, i.e. <title>11 Language and Theory. The chapter number and title might be styled differently but if both are inside a heading tag then both should be put into the title tag

The only reason was that I did not have much test-material to go on, and if I set that flag, the sample Laura provided me came out mangled because it has chapter numbers prefixed to the titles.

I think I'll probably need to refine this so we can tell it to only suppress very specific CSS classes.

zwettemaan commented 5 years ago

No, it's more complicated than that. I've explained that in the documentation I linked to earlier. If I don't omit the span, the title comes out

iiiMy Chapter

without a space between the title and the number.

So without setting the keepTitleSubtags to 0, Laura's example comes out mangled - i.e. unusable. You'd have to manually go through and add those spaces. The XHTML input from InDesign does not have a space between the number and the title text.

zwettemaan commented 5 years ago

I don't think a default 'works for all' is possible - i.e. we cannot avoid having some config so the user can adjust to the job at hand.

zwettemaan commented 5 years ago

Hi Farrah, the EPUB seems to work for me. I've made a video of what I do and will email it.

flittle8 commented 5 years ago

@zwettemaan would it be possible to set a rule where you just replace the tags of any nested tags with a space? We can assume that all nested tags are separating words (or #'s and words).

No, it's more complicated than that. I've explained that in the documentation I linked to earlier. If I don't omit the span, the title comes out

iiiMy Chapter

without a space between the title and the number.

So without setting the keepTitleSubtags to 0, Laura's example comes out mangled - i.e. unusable. You'd have to manually go through and add those spaces. The XHTML input from InDesign does not have a space between the number and the title text.

flittle8 commented 5 years ago

@zwettemaan a suggestion, at first I wasn't sure which generated epub was the new and which the old. to be consistent with how individual files are differentiated, could we rename the old ebook with "_old" at the end?

zwettemaan commented 5 years ago

Yes, that is possible, but I am not sure that will always work. E.g. if the first character of something is formatted in a different font, e.g. 'The Title' and the first 'T' is in a different font or has a different style, then there would be a span around that first T, but that first T should not be separated by a space.

E.g.

<p><span class="fancy">T</span>he Title Is here</p>

What I think I need to do is make it flexible through the config file. There are just too many possibilities to try and make a 'catch all'. It will depend on the ebook - e.g. an ebook out of InDesign will need a different treatment than other ebooks. We could have a number of pre-made config files for the most common cases (e.g. always insert a space, don't insert a space) so for basic users it would be a matter of picking a file. More complex use cases would involve a user manually editing a config file to match their exact situation...

zwettemaan commented 5 years ago

Yes, that makes sense. I need to give DropToScript a bit of an overhaul (to show pages processed etc...) and I'll add that in at the same time.

flittle8 commented 5 years ago

ah yes... sounds good about having some ready-to-go config options

What I think I need to do is make it flexible through the config file. There are just too many possibilities to try and make a 'catch all'. It will depend on the ebook - e.g. an ebook out of InDesign will need a different treatment than other ebooks. We could have a number of pre-made config files for the most common cases (e.g. always insert a space, don't insert a space) so for basic users it would be a matter of picking a file. More complex use cases would involve a user manually editing a config file to match their exact situation...

zwettemaan commented 5 years ago

Aha. I think I figured it out. El Capitan comes by default with PHP 5.3.8 or so. I found that before version PHP 5.6 a critical feature the AutoTitle.php uses is not supported.

https://www.php.net/manual/en/class.domnode.php#95545

On the PHP I have installed, I can simply do something like

$title->textContent = "some new content";

but for pre-5.6 versions that does not work; it simply silently fails (so none of my normal logging and error trapping is triggered and it simply does not work).

I've now re-written AutoTitle.php to use a workaround that should also work on PHP 5.3.8.

Try the next update...

https://github.com/BCLibCoop/nnels-a11y-publishing/tree/master/ReleaseVersions

The 'inject space' should now also work - i.e. I get '1 Hou Hou le hibou' instead of '1Hou Hou le hibou' with the default AutoTitle.config.txt provided.

zwettemaan commented 5 years ago

@flittle8 Forgot to 'tag' you... I've got another new version to try...

zwettemaan commented 5 years ago

P.S. Note: so the whole idea about accented characters in file paths is moot: it was not that...

flittle8 commented 5 years ago

@zwettemaan great, so the script worked, the element was replaced :)</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/flittle8"><img src="https://avatars.githubusercontent.com/u/23460746?v=4" />flittle8</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <p>however... something is now happening which messes up the EPUB so that it doesn't render (content of each page won't display). it's related to the declaration at the top of each page. i've attached a video. this happened with all the epub I dropped on to the app. I need to use Sigil to update the declaration in order to render the epub. <a href="https://www.dropbox.com/s/o0tu12t2s6d1yn6/AutoTitle-epub-changed.mov?dl=0">dropbox video</a></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/flittle8"><img src="https://avatars.githubusercontent.com/u/23460746?v=4" />flittle8</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <pre><code><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <?xml version="1.0" encoding="UTF-8" standalone="no"?><html xmlns="http://www.w3.org/1999/xhtml"></code></pre> <p>needs to be updated to:</p> <pre><code><?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"></code></pre> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/flittle8"><img src="https://avatars.githubusercontent.com/u/23460746?v=4" />flittle8</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <p>@zwettemaan regarding the issue with the new individual file being hidden upon generation - that still occurs. I think this might be related to the renaming instead of copying of the file that I mentioned in my email</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/zwettemaan"><img src="https://avatars.githubusercontent.com/u/3396477?v=4" />zwettemaan</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <p>Ok, that header, I can fix. I am using a standard PHP HTML DOM parser, and it puts in a default header that I need to replace.</p> <p>The file being hidden, I'll have to do some experiments. </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/zwettemaan"><img src="https://avatars.githubusercontent.com/u/3396477?v=4" />zwettemaan</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <p>Hmm... I am looking at 007_contents.html (as of yet untouched by the script) and it says:</p> <pre><code><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <?xml version="1.0" encoding="UTF-8"?><html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"></code></pre> <p>It's quite probable the PHP code is muddling things up in its own weird and wonderful ways, but at least some of the pages in the sample EPUB are already muddled to start with, I think...</p> <p>What I'll do is strip the header all together (whatever it is, good or bad), and put it back afterwards. </p> <p>So, if it is bad to start, it'll come out bad as well, but at least it'll be the same.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/zwettemaan"><img src="https://avatars.githubusercontent.com/u/3396477?v=4" />zwettemaan</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <p>@flittle8 Got a new release for you</p> <p><a href="https://github.com/BCLibCoop/nnels-a11y-publishing/tree/master/ReleaseVersions">https://github.com/BCLibCoop/nnels-a11y-publishing/tree/master/ReleaseVersions</a></p> <p>Main changes:</p> <ul> <li> <p>Now uses a copy rather than a rename to make backups. I think you're right, and that this might fix the issue you see.</p> </li> <li> <p>I added a separate DropScript 'CleanHeader' which allows you to override the header-area of the HTML files, whatever it may be. This allows you to standardize all HTML files with the same header in one go. The header is configurable in the CleanHeader.config.txt file. It is defaulting to</p> <pre><code><?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> ... rest of file ...</code></pre> </li> <li> <p>The DropToScript now reports which files it has modified.</p> </li> <li> <p>The DropScripts (except for CleanHeader) retain whatever is in the header (i.e. whatever <!DOCTYPE..., <?xml... in whatever sequence is there).</p> </li> </ul> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/flittle8"><img src="https://avatars.githubusercontent.com/u/23460746?v=4" />flittle8</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <p>@zwettemaan thanks Kris. the new files are no longer hidden, and the AutoTitle is working :)</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/flittle8"><img src="https://avatars.githubusercontent.com/u/23460746?v=4" />flittle8</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <p>@zwettemaan The AutoTitle script is still causing rendering issues. It is changing the header which is causing problems. I've attached an image that shows the header of the original file (no issues, renders fine) and an image that shows the header of the output file (content doesn't render). I don't think I should have to run a script to solve this since the input file works fine. <img src="https://user-images.githubusercontent.com/23460746/57701259-8b1ff000-7653-11e9-9ad3-fbb0c86801ea.png" alt="dropscript-input-fie_NoError" /> <img src="https://user-images.githubusercontent.com/23460746/57701534-1ef1bc00-7654-11e9-97cd-fff7d11c9ae6.png" alt="dropscript-output-file-getError" /></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/zwettemaan"><img src="https://avatars.githubusercontent.com/u/3396477?v=4" />zwettemaan</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <blockquote> <p>I don't think I should have to run a script to solve this since the input file works fine.</p> </blockquote> <p>I think I did not make myself clear. The reason for the CleanHeader is page 007 in the sample EPUB you sent. Check it out - as far as I can tell, the 007 page has a bad header to start with, before any script has even touched it.</p> <p>The missing !DOCTYPE is a different issue - I did not mean that you should need to use CleanHeader for that. I completely agree I need to fix AutoTitle to work correctly. That is clear. Please don't misunderstand the reason for existence of CleanHeader.</p> <p>Please take the original EPUB you sent me, and decompress it. No DropScript should have touched that. Look at file 007. When I do that I see:</p> <img width="731" alt="Screen Shot 2019-05-15 at 2 11 51 AM" src="https://user-images.githubusercontent.com/3396477/57704868-e2a37780-76b6-11e9-8f9b-2a2f8cb47e6b.png"> <p>AFAIK this is not caused by DropScript. CleanHeader fixes this.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/zwettemaan"><img src="https://avatars.githubusercontent.com/u/3396477?v=4" />zwettemaan</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <p>Can you send me a copy of the file ...0006 from your screenshot, so I can use it as test-material? Thx!</p> </div> </div> <div class="page-bar-simple"> <a href="/BCLibCoop/nnels-a11y-publishing/2?page=2" class="next">Next</a> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>