Font resolution: serif implicitly added even if built-in fallback already matched

I noticed that the font resolution algorithm stubbornly adds the serif built-in font, no matter if the selected font families have already found an appropriate built-in match (see com.openhtmltopdf.pdfboxout.PdfBoxFontResolver.resolveFont(..)):

public class PdfBoxFontResolver implements FontResolver {
. . .
    private FSFont resolveFont(SharedContext ctx, String[] families, float size, IdentValue weight, IdentValue style, IdentValue variant) {
. . .
        // Built-in fonts.
        if (families != null) {
            resolveFamilyFont(ctx, families, size, weight, style, variant, fonts, _builtinFonts);

            FontDescription serif = _builtinFonts.resolveFont(ctx, "Serif", size, weight, style, variant);
            if (serif != null) {
                fonts.add(serif);
            }
        }
. . .
    }
. . .
}

That behavior adversely affects the font metrics calculation (see com.openhtmltopdf.pdfboxout.PdfBoxTextRenderer.getFSFontMetrics(..)) and, in turn, the actual text placement (see com.openhtmltopdf.layout.InlineBoxing.calculateInlineMeasurements(..)), as parameters like ascent and descent are calculated from incoherent typefaces. For example, if monospace is selected via CSS, the result is an ugly extra space above the ascender line which unbalances the ascent/descent ratio, making text feel like it's sitting on the line bottom instead of flowing in the middle -- here it is a comparison (on the left, the current wrong behavior, on the right the correct one generated with the code fix here below):

Wrong (serif abuse) Correct (no serif abuse)

Here it is the way Gecko (Firefox) renders the same input (note the balanced ascent/descent ratio):

Source HTML: serifFallback.html.txt

The nuisance can be easily fixed with a little code change:

public class PdfBoxFontResolver implements FontResolver {
. . .
    private FSFont resolveFont(SharedContext ctx, String[] families, float size, IdentValue weight, IdentValue style, IdentValue variant) {
. . .
        // Built-in fonts.
        if (families != null) {
            if (!resolveFamilyFont(ctx, families, size, weight, style, variant, fonts, _builtinFonts)) {
                fonts.add(_builtinFonts.resolveFont(ctx, "Serif", size, weight, style, variant));
            }
        }
. . .
    }
. . .
    private boolean resolveFamilyFont(
            SharedContext ctx,
            String[] families,
            float size,
            IdentValue weight,
            IdentValue style,
            IdentValue variant,
            List<FontDescription> fonts,
            AbstractFontStore store) {
        boolean resolved = false;
        for (int i = 0; i < families.length; i++) {
            FontDescription font = store.resolveFont(ctx, families[i], size, weight, style, variant);
            if (font != null) {
               fonts.add(font);
               resolved = true;
            }
        }
        return resolved;
    }
. . .
}

Thanks @stechio for the detailed write-up. The trouble is that we don't know at this point if the matched font will contain all the characters needed. So just not adding serif would change behaviour. Consider font-family: 'KoreanCharsOnly' which then uses Latin characters.

To know if we are actually using serif we could run the font run divider again but this is extremely slow due to using exceptions from PDFBOX to check whether a character is in the given font.

I think what I need to do is:

Return the list of font runs from getWidth along with the width so this can be stored in the line break context and inline text objects and then passed back to the text renderer to actually render text or get the font metrics.
Submit a patch to PDFBOX so that exceptions (which are very slow) are not the only way to signal that a PDFont does not contain a given character.

I'll see if I can have a go at the first item soon.

danfickle / openhtmltopdf

Font resolution: serif implicitly added even if built-in fallback already matched #698