Kozea / WeasyPrint

The awesome document factory
https://weasyprint.org
BSD 3-Clause "New" or "Revised" License
7.21k stars 684 forks source link

Invalid CMap table generated #1571

Closed janvogt closed 2 years ago

janvogt commented 2 years ago

Observation

Some printable UTF-8 charachters couldn't be recovered from the generated PDF. A poppler-cpp based pdftotext library warned:

poppler/error: Invalid entry in bfchar block in ToUnicode CMap

Research

Following down this trail I figured out that the following line in the CMap of the PDF generated by waesyprint triggered this error, since it is larger than 4096. See check in poppler source code

<10006c5b> <6c5b>

Using a hex editor to change the initial 1 to a 0, i.e.

<00006c5b> <6c5b>

fixed the PDF and made the character recoverable and also visible in a PDF Viewer (Apple Preview).

I attempted to figure out how the faulty value is generated, but had to give up eventually. The relevant line just renders the glyph, but it is unclear to me where the glyph originates.

Reproducible example

A reproducible example is availiable at https://github.com/janvogt/repro-weasyprint-cmap-utf8. Also I attached the invalid pdf example.pdf and the fixed pdf example_fixed.pdf, generated from the following html:

<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
</head>

<body>
  汛
</body>

</html>
liZe commented 2 years ago

Thanks for the bug report.

The problem comes from the glyph number: PDF requires this number to be lower than ffff, and we have 10006c5b. That’s a very high number, and we can assume that your font doesn’t include more than 200 billions characters :D.

Actually, DejaVu Sans doesn’t include your character. I suppose that no font installed on your system includes this character, and that Pango falls back to an empty glyph. So, I suppose that your bug is actually #1508, already fixed in version 54.

Removing the leading 1 syntactically "fixes" the PDF, but it’s not a real fix. As glyph number 6c5b doesn’t exist in DejaVu (that’s embedded in the PDF), your PDF reader uses another font (installed on the computer where the PDF is viewed) to render Unicode character number 6c5b. It’s just a magic trick of the PDF renderer, but it doesn’t really work ;).

You can try to render your document with version 54.x, and you’ll probably get a correct but empty PDF as there’s no font including this character on your system. Or you can try to install a font that includes Chinese ideograms, and it will work, even with 53.x. That’s my assumption!

janvogt commented 2 years ago

Thanks for the quick response!

Unfortunately, I was already using version 54.1 except in the repro. Here are the same pdfs rendered using 54.1 (and I also updated the repro):

Where you are correct is that the used font does indeed not contain a glyph for unicode charachter 0x6c5d. It turns out that using a font containing such a glyph, the problem does indeed not occur (see example_full_font.html in the repro).

However, I still think it is a bug to generate an invalid pdf when the font is missing a glyph. To me the expected behaviour would be to render the charachter not found glyph. But I could live with the current behaviour of showing nothing as well.

In any case though, when extracting the text from the pdf (e.g. using something like pdftotext) all characters should be preserved via a correct CMap. Otherwise, property testing becomes very cumbersome... But maybe there are good reasons against it, I just don't see?

After the report I also dug a little deeper and it seems that this way too large glyph value comes from pango. However, I am not 100% sure...

liZe commented 2 years ago

However, I still think it is a bug to generate an invalid pdf when the font is missing a glyph.

That’s true.

To me the expected behaviour would be to render the charachter not found glyph. But I could live with the current behaviour of showing nothing as well.

We’ll keep Pango’s fallback mechanism handle this for us, looks like it displays nothing ad not a fallback character.

In any case though, when extracting the text from the pdf (e.g. using something like pdftotext) all characters should be preserved via a correct CMap. Otherwise, property testing becomes very cumbersome... But maybe there are good reasons against it, I just don't see?

No, you’re right there’s a bug we should fix.

But your problem is specific, it’s not the common case. Usually, when a character is missing, Pango finds a fallback character and everything goes well. But here, Pango wants to use the glyph number 0x10006c5b (that’s 268463195 in decimal), and the PDF specification tells in 9.7.6.2 that "The code length shall not be greater than": that’s the problem you have.

I doubt that a font on your system has so many characters included, so I think that Pango doesn’t give a real glyph number. This code having the Unicode character number at the end (6c5b) is another hint, as Unicode characters and glyph ids are usually unrelated.

We have to check Pango’s documentation to see if this code doesn’t mean something else. And even if we don’t find anything in the doc, we should fix this case so that we display nothing instead of including a forbidden code.

janvogt commented 2 years ago

Turns put there is a PANGO_GLYPH_UNKNOWN_FLAG that has exactly the offset we're seeing: 0x10000000. So every unicode character of the shape 0x0....... has a corresponding glyph 0x1....... in Pango that is used if it's not available in the current font. I think that is what we're seeing here.

This is an example of how it is used in the Pango Codebase https://gitlab.gnome.org/GNOME/pango/-/blob/main/pango/pango-layout.c#L1458

liZe commented 2 years ago

Turns put there is a PANGO_GLYPH_UNKNOWN_FLAG that has exactly the offset we're seeing: 0x10000000.

Thanks a lot, we understand what’s going on now, and we can fix this bug.

liZe commented 2 years ago

The bug is fixed in the 54.x and master branches. Tests and feedback are welcome!

janvogt commented 2 years ago

I am happy to confirm that the fix works in the minimal repro. Thanks for the quick responses, as well as providing and maintaining this awesome tool!