danfickle / openhtmltopdf

An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!
https://danfickle.github.io/pdf-templates/index.html
Other
1.9k stars 355 forks source link

Font resolution: serif implicitly added even if built-in fallback already matched #698

Open stechio opened 3 years ago

stechio commented 3 years ago

I noticed that the font resolution algorithm stubbornly adds the serif built-in font, no matter if the selected font families have already found an appropriate built-in match (see com.openhtmltopdf.pdfboxout.PdfBoxFontResolver.resolveFont(..)):

public class PdfBoxFontResolver implements FontResolver {
. . .
    private FSFont resolveFont(SharedContext ctx, String[] families, float size, IdentValue weight, IdentValue style, IdentValue variant) {
. . .
        // Built-in fonts.
        if (families != null) {
            resolveFamilyFont(ctx, families, size, weight, style, variant, fonts, _builtinFonts);

            FontDescription serif = _builtinFonts.resolveFont(ctx, "Serif", size, weight, style, variant);
            if (serif != null) {
                fonts.add(serif);
            }
        }
. . .
    }
. . .
}

That behavior adversely affects the font metrics calculation (see com.openhtmltopdf.pdfboxout.PdfBoxTextRenderer.getFSFontMetrics(..)) and, in turn, the actual text placement (see com.openhtmltopdf.layout.InlineBoxing.calculateInlineMeasurements(..)), as parameters like ascent and descent are calculated from incoherent typefaces. For example, if monospace is selected via CSS, the result is an ugly extra space above the ascender line which unbalances the ascent/descent ratio, making text feel like it's sitting on the line bottom instead of flowing in the middle -- here it is a comparison (on the left, the current wrong behavior, on the right the correct one generated with the code fix here below):

Wrong (serif abuse) Correct (no serif abuse)

Here it is the way Gecko (Firefox) renders the same input (note the balanced ascent/descent ratio):

Firefox

Source HTML: serifFallback.html.txt

The nuisance can be easily fixed with a little code change:

public class PdfBoxFontResolver implements FontResolver {
. . .
    private FSFont resolveFont(SharedContext ctx, String[] families, float size, IdentValue weight, IdentValue style, IdentValue variant) {
. . .
        // Built-in fonts.
        if (families != null) {
            if (!resolveFamilyFont(ctx, families, size, weight, style, variant, fonts, _builtinFonts)) {
                fonts.add(_builtinFonts.resolveFont(ctx, "Serif", size, weight, style, variant));
            }
        }
. . .
    }
. . .
    private boolean resolveFamilyFont(
            SharedContext ctx,
            String[] families,
            float size,
            IdentValue weight,
            IdentValue style,
            IdentValue variant,
            List<FontDescription> fonts,
            AbstractFontStore store) {
        boolean resolved = false;
        for (int i = 0; i < families.length; i++) {
            FontDescription font = store.resolveFont(ctx, families[i], size, weight, style, variant);
            if (font != null) {
               fonts.add(font);
               resolved = true;
            }
        }
        return resolved;
    }
. . .
}
danfickle commented 3 years ago

Thanks @stechio for the detailed write-up. The trouble is that we don't know at this point if the matched font will contain all the characters needed. So just not adding serif would change behaviour. Consider font-family: 'KoreanCharsOnly' which then uses Latin characters.

To know if we are actually using serif we could run the font run divider again but this is extremely slow due to using exceptions from PDFBOX to check whether a character is in the given font.

I think what I need to do is:

I'll see if I can have a go at the first item soon.