eclipse-birt / birt

Eclipse BIRT™ The open source reporting and data visualization project.
http://www.eclipse.org/birt
Eclipse Public License 2.0
451 stars 389 forks source link

Tab characters break PDF/A conformance #1948

Open hvbtup opened 3 hours ago

hvbtup commented 3 hours ago

When a data item contains text like (in Java/C/JS syntax) "abc\tdef" containing a horizontal tab character (U+0009), this breaks PDF/A conformance.

I examined this in more detail.

In the resulting PDF, the tab character is contained as text in the content stream, using the "Times-Roman" font. This font is one of the "builtin" PDF fonts, not a TrueType font. Thus it is not embedded (or subsetted) into the PDF. This means, the glyph metrics and shapes are not contained in the PDF.

But PDF/A requires that all fonts must be embedded/subsetted.

What happens internally is:

There is a class FontSplitter in BIRT which is called from ChunkGenerator.getNext(), which in turn is called from TextCompositor.

It's task seems to be to ensure that every character of the text can be displayed; so it uses rules to select a font for each character (most font files only support a small subset of the Unicode characters).

In case of the TAB character, the method FontHandler.selectFont(c) returns "Times-Roman". This happens when the preferred font (in my case, "Arial") does not support the character (method charExists(c) return null). By the way, charExistsfor TrueType fonts looks for the glyph metrics, whereas charExists for Type1 fonts works very different.

The BIRT logic thinks that "Times-Roman" supports the TAB character, and thus, this font is selected.

Other control characters in the Unicode Code Point range 0-31 will proably cause issues, too.

In an ideal world, text in data items would not contain any TAB characters.

I am not sure how the different control characters should be handled by the different emitters.

Furthermore, I am not aware of an obvious workaround/solution without changing BIRT Source code.

I looked at some TrueType fonts (including some fonts for Code 39 and Code 128 barcodes, consola.ttf, arial.ttf, arialuni.ttf. None of them supports the TAB character.

The reason is obvious: A TAB character cannot have a glyph width, because its width is dynamic by definition.

hvbtup commented 2 hours ago

tab_character_pdfa.zip This rptdesign file can be used for demonstrating the issue: When a PDF report is created with BIRT, it references the Times-Roman font.

hvbtup commented 2 hours ago

I'll create a PR which fixes this specific issue with the TAB character by replacing it with a space for PDF output...

wimjongman commented 40 minutes ago

Thanks, Henning!