Headers with special chars are generating invalid Mobi

vepo commented 2 years ago

I have a book with some headers with Latin characters like áéíôú and other characters as the degree symbol.

An example can be

= O Início

I'm executing asciidoctor-epub3 and a the result a EPUB is create with invalid files. It's creating a file with _o_início.xhtml and the Kindle Previewer cannot open it with the error:

"Type","Description"
"Error","E24010: Hyperlink not resolved in toc (One possible reason can be that the link points to a tag with style display:none):<PATH>\EPUB\_o_inÃcio.xhtml#","","","",""
"Error","E24001: The table of content could not be built.","","","",""
"Notice","W14001: Hyperlink not resolved:  <PATH>\EPUB\_o_inÃcio.xhtml","","","",""
"Notice","W14002: Some hyperlinks could not be resolved.","","","",""
"Notice","W14003: The start reading location could not be resolved.","","","",""
"Status","Book conversion failed."

How can I remove all non [a-z] characters from the headers using asciidoctor-epub3?

Is this a bug?

I'm using the GitHub action vepo/asciidoctor-action

To validate that this is the error, I unzip the generated EPUB, remove all Latin characters from the files and references and zip it again. Then it works.

slonopotamus commented 2 years ago

First of all, you may use section ids to customize that behavior:

[#alternative_id]
= O Início

There was #217 where we intentionally try to add suport for Unicode filenames in generated files. Looks like KindleGen is incapable of Unicode. I guess we need to use some other naming scheme when producing Mobi files.

So let me clarify: this is a valid EPUB. EPUB spec allows unicode filenames: http://idpf.org/epub/301/spec/epub-ocf.html#sec-container-filenames. But KindleGen is not able to process it properly.

mojavelinux commented 1 year ago

What you're probably looking for is ICU collation/transliteration, which we use in Asciidoctor PDF. See https://github.com/asciidoctor/asciidoctor-pdf/blob/main/lib/asciidoctor/pdf/index_catalog.rb#L60 I'm not convinced that throwing away the name of the chapter and replacing it with a number is a great solution.

slonopotamus commented 1 year ago

It is extremely hard to do human-friendly identifiers.

We need to only use a subset of symbols
We need to avoid collision between generated and user-specified IDs
We need to reserve identifiers for preface/covers/toc/etc
User needs to be properly warned if their custom IDs cause issues

Neither of 2+3+4 are fully implemented as of today.

slonopotamus commented 1 year ago

I've been thinking about this issue for couple of days already and a proper solution seems to be just too much complex.

mojavelinux commented 1 year ago

I would think you could use the same solution we use for section IDs in an AsciiDoc document. Clear away invalid characters (either via regex, ICU transformation, or both), then append a counter if the name is already used. That doesn't seem unreasonably complex to me. However, if the users are happy with a straight counter, then I suppose my case is moot.

slonopotamus commented 1 year ago

Okay, I have a plan. I will reuse :sectids: logic.

asciidoctor / asciidoctor-epub3

Headers with special chars are generating invalid Mobi #417