HTML entities (and tags) in chapter titles

sime1 commented 4 years ago

At the moment, when generating the inline TOC, chapter titles are stripped of HTML tags using the html2text crate. This however makes it impossible to use certain titles (e.g. we can't have a chapter named "the <code> tag") and creates an invalid "toc.ncx". Moreover, using html2text to escape the title can turn valid titles into invalid ones; this happens when the title contains HTML entities (e.g. &), which get converted into the relative character.

The solution could be to use the html_escape crate to escape the title, either using it after html2text or in place of it. If we replace html2text there would be no limitations to the chapter titles, however the behavior would be different than before for titles that contain HTML tags. Using both the crates would limit the possible titles, but still fix the invalid TOC and maintain the current behavior.

Personally I would replace the html2text and use only html_escape, since this would also reduce the dependencies quite a bit.

Once what to do is decided I can create a pull request

crowdagger commented 4 years ago

I think replacing with html_escape sounds good. I don't have much time for dev right now but if you can create a pull request I will definitely merge it. Thanks!

crowdagger commented 4 years ago

Thank you!

crowdagger commented 4 years ago

Just published version 0.4.8 with your fix, thanks again

crowdagger / epub-builder

HTML entities (and tags) in chapter titles #13