Aivean / royalroad-downloader

https://royalroad.com book downloader
MIT License
57 stars 4 forks source link

[Feature Request] Unique class for chapter tags #22

Closed ZeeWanderer closed 3 years ago

ZeeWanderer commented 3 years ago

Can you add a unique class to identify all chapter title HTML tags? Should be trivial, considering what I understand of the code, but I don't practice scala and don't have any env set up to make a pr. This change would ease converting (primarily structure deduction) to other formats. Many fictions don't use traditional keywords signifying a chapter so detecting those correctly may be a bother. Given that so far the only h1 tags in the output I have seen are the chapter titles and assuming all fictions fave no h1 tags in the text then structure deduction is still trivial but better be safe than sorry.

Aivean commented 3 years ago

Can you please give an example of the desired output?

Should it be something like:

<h1 class="chapter1">Chapter One</h1>
...
<h1 class="chapter2">Chapter Two</h1>

Or

<h1 class="chapter">Chapter One</h1>
...
<h1 class="chapter">Chapter Two</h1>

It would be even better if you could explain your use case in more details, i.e. specifically what doesn't work currently and will be fixed by chapter classes.

ZeeWanderer commented 3 years ago

The second one would be better

<h1 class="chapter">Chapter One</h1>
...
<h1 class="chapter">Chapter Two</h1>

calibre for example uses XPath expressions to identify and extract structure. By default, it checks for specific tags like h1 or h2, specific keywords like chapter and prologue but the best way is to have some unique attribute to identify all chapter title entries. Currently, I can just check for h1 and be done with it but it is not really guaranteed to work unless chapter text is guaranteed to not contain h1 tags. As a plus with this specific change (class='chapter') structure deduction would work out of the box for calibre and online converters that utilize it or something similar. (I actually checked by manually attributing some chapter entries with chapter class and using a few top search results for HTML to EPUB converter). The default XPath expr for this is //*[((name()='h1' or name()='h2') and re:test(., '\s*((chapter|book|section|part)\s+)|((prolog|prologue|epilogue)(\s+|$))', 'i')) or @class = 'chapter']

Aivean commented 3 years ago

Thank you for the explanation. I'll add the "chapter" class as soon as I have time.