asciidoctor / asciidoctor-pdf

:page_with_curl: Asciidoctor PDF: A native PDF converter for AsciiDoc based on Asciidoctor and Prawn, written entirely in Ruby.
https://docs.asciidoctor.org/pdf-converter/latest/
MIT License
1.14k stars 500 forks source link

Character set warning after page break #2453

Closed pwaehnert closed 10 months ago

pwaehnert commented 10 months ago

If a sufficiently long text is followed by a block image such that the image must be positioned on a new page I got a warning about not fully convertible characters. The minimal example is a bit dull:

// test.adoc
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt
...
ipsum dolor sit amet.

// repeat that previous paragraph as often as necessary such that the following block image is placed on a new page

image:test.png[]

This file must be converted by the following call:

> asciidoctor-pdf -a pdf-theme=base -w -v -t test.adoc
asciidoctor: WARNING: The following text could not be fully converted to the Windows-1252 character set:
| ⁣?

The automatic page break seems to be the problem. But interestingly enough the base theme seems to play an important role too, since omitting it fixes the issue.

mojavelinux commented 10 months ago

Please provide a full reproducible example so I can run it. If this is, in fact, an issue, it seems to depend on a very specific set of circumstances that I don't want to have to spend time trying to figure out how to reproduce.

pwaehnert commented 10 months ago

Here's a minimal AsciiDoctor file:

// test.adoc
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

image:test.png[]

And this might be an example image test.png:

test

This minimal example is converted by the following command line call:

> asciidoctor-pdf -a pdf-theme=base -w -v -t test.adoc
asciidoctor: WARNING: The following text could not be fully converted to the Windows-1252 character set:
| ?

And I'm using Windows 10. I suspect the warning won't appear on GNU/Linux nor macOS.

pwaehnert commented 10 months ago

I also tried to minimize the base theme. It is indeed possible to reduce the used theme to an empty file. But it is nonetheless essential to refer to a theme, even if it is empty. I think that the default values without any given theme contain something special that triggers this encoding warning.

mojavelinux commented 10 months ago

I see what's happening here. The text for the fragment of an inline image is temporarily set to a placeholder character (\u2063) in order to reserve space where the image will be inserted. That placeholder character is never rendered. However, when the text containing that fragment is advanced to a new page, it causes the text to be normalized again. If the font is an AFM font, as is the case when using the base theme, it checks that the character can be encoded into the Windows 1252 character set, a requirement of using an AFM font. In this case, \u2063 cannot be encoded. This is when the converter normally looks for a fallback character. However, the fallback character for \u2063 is not defined in the map. (See https://github.com/asciidoctor/asciidoctor-pdf/blob/v2.3.9/lib/asciidoctor/pdf/ext/prawn/font/afm.rb#L6-L12).

The fix that is needed here is to add a fallback character for \u2063 to the aforementioned map so that the text normalization operation succeeds. That character is never rendered, so the fallback value doesn't really matter.

I will apply a fix and add a test for this.

With that said, the base theme is not intended to be used directly. Rather, it is intended as a theme that you extend to add your own fonts and styles. The base theme defaults to AFM fonts, but these fonts are extremely limited. You are encouraged to use TrueType fonts, as described in the docs at https://docs.asciidoctor.org/pdf-converter/latest/theme/font-support/.

If you use the default theme instead of the base theme, you would not receive this warning. Also, this is just a pedandic/verbose warning, not an error. It just communicates when the converter encounters a character that it cannot deal with. If that character isn't important, which is the case here, it does not impact the result.

pwaehnert commented 10 months ago

Thank you for the explanation and the immediate bug fix! Are there any plans when Version 2.3.10 will be published?

I don't use the base theme directly but extend it in my own theme. But you're right, I didn't overwrite the font properties.

I already suspected that this warning doesn't mean much. But I'd like to rise the failure level to warnings in order to fetch other bugs early enough. For example if images are missing I'd like to abort the conversion. It is very tedious to scan manually through our large documentation in order to spot those missing images.

mojavelinux commented 10 months ago

Are there any plans when Version 2.3.10 will be published?

When I get to it. Though I am interested in getting a release out as the fixes are starting to accumulate.

But I'd like to rise the failure level to warnings in order to fetch other bugs early enough.

You could still do that, just take away the -v flag. The -v flag is adding pedantic warnings, which shouldn't be correlated with a fail fast since they are not warnings which are guaranteed to be problems. In other words, they are intended to be informational.