deeplook / svglib

Read SVG files and convert them to other formats.
GNU Lesser General Public License v3.0
315 stars 80 forks source link

Errors when attempting to load TrueType Collection fonts #226

Open rocketnova opened 4 years ago

rocketnova commented 4 years ago

When I attempted to convert SVGs that use a .ttc font, I get the Unable to find a suitable font for 'font-family:... error.

Here's a sample SVG:

<svg viewBox="0 0 400 100" preserveAspectRatio="xMidYMid meet" xmlns="http://www.w3.org/2000/svg">
  <text fill="black" font-family="圓體W3寬注音" font-size="80px" x="200" y="70" text-anchor="middle">上個月</text>
</svg>

The font file is 圓體W3寬注音.ttc. Here's a simplified example of what I run:

>>> from svglib.svglib import svg2rlg
>>> drawing = svg2rlg("svgs/上個月.svg")
Unable to find a suitable font for 'font-family:圓體W3寬注音'

I traced the issue to https://github.com/deeplook/svglib/blob/4ad421bf03b64adfdc85d177b011a90ed8990d17/svglib/svglib.py#L87 which only searches for .ttf files and not also .ttc files. TrueType Collections are described on Wikipedia as an extension of the TrueType format.

This might be related to or explain the issue seen in #107. When I change the code on line 87, I was able to successfully register the font with ReportLab through svg2rlg. Perhaps it would be possible to attempt to register/find .ttf files with ReportLab and then, if not found, to attempt .ttc files?

claudep commented 4 years ago

Your suggestion sounds good, would you be able to provide a patch?

deeplook commented 4 years ago

Sounds useful, but does reportlab fully support TrueType Collections, @replabrobin?

MrBitBucket commented 4 years ago

We do manage to read some ttc files. The wrinkle is that we need to specify some more information to get the right font within the collection. In the reportlab/tests/test_multibyte_jpn.py file we see this

msmincho = TTFont('MS Mincho','msmincho.ttc',subfontIndex=0,asciiReadable=0)

ie we select the subfont with index 0.

rocketnova commented 4 years ago

TL;DR I happy to close this issue without resolving because TTC seems complicated. 😞

Long write up incoming.

rocketnova commented 4 years ago

I originally successfully got my font to work by simply using registerFont(TTFont(font_name, '%s.ttc' % font_name)), but since I didn't specify a subfontIndex, I assume the default subfontIndex 0 worked. This seems pretty brittle.

I can amend my SVG to add a TTC subfontIndex number into something like:

<svg viewBox="0 0 400 100" preserveAspectRatio="xMidYMid meet" xmlns="http://www.w3.org/2000/svg">
  <text fill="black" font-family="圓體W3寬注音#2" font-size="80px" x="200" y="70" text-anchor="middle">上個月</text>
</svg>

We could patch find_font() to the following and svglib correctly accepts 圓體W3寬注音#2 and TTFont is able to find and load the TTC file:

    try:
        # Try to register the font as a ttc with a subfontIndex
        # specified after the '#'.
        result = re.match(r'(.*)#(\d+)', font_name)
        if result != None:
            fn = result.group(1)
            sfi = int(result.group(2))
            registerFont(TTFont(font_name, '%s.ttc' % fn, subfontIndex=sfi))
        # Try to register the font if it exists as ttf,
        # based on ReportLab font search.
        else:
            registerFont(TTFont(font_name, '%s.ttf' % font_name))
        _registered_fonts[font_name] = True
        return font_name, True

HOWEVER It's basically impossible to distinguish between the different subfonts in the TTC files I have to test with, so I don't know how useful/accurate this test is.

IN ADDITION I did a bunch of research today about what the most correct way to indicate the subfont in a TrueType Collection is and ran into 2 issues:

1. How to indicating TrueType Collection subfont to ReportLab?

It looks like the W3C current recommendation (and still true in the current draft) is to use the PostScript name, rather than a number as a fragment identifier. See this github issue where W3C draft recommendation was added. Apparently, when you add additional subfonts to a TTC, the new subfonts aren't always appended, so number indices can change unexpectedly.

This contrasts with the way that reportlab handles subfonts using numbered indexes. At https://github.com/eduardocereto/reportlab/blob/master/src/reportlab/pdfbase/ttfonts.py#L1002:

def __init__(self, name, filename, validate=0, subfontIndex=0,asciiReadable=None):

(Note: I'm not sure if this is an up-to-date mirror, but it was the best I could find)

So issue 1 is: how to indicate the correct subfont to svglib using numbers when the W3C is not recommending numbers?

2. CSSSelect2 can't parse @font-face declarations

I set aside the issue of indicating subfont using numbers or PostScript name for a second. Following the W3C recommendation, it looks like I should be able reformat my SVG to use @font-face to indicate the correct subfont. Something like so:

<svg viewBox="0 0 400 100" preserveAspectRatio="xMidYMid meet" xmlns="http://www.w3.org/2000/svg">
  <defs>
    <style type="text/css">
      @font-face {
        font-family: 'DFPYuanW3-ZhuInW';
        src: url("圓體W3寬注音.ttc#DFPYuanW3-ZhuInW");
      }
    </style>
  </defs>
  <text fill="black" font-family="DFPYuanW3-ZhuInW" font-size="80px" x="200" y="70" text-anchor="middle">上個月</text>
</svg>

This correctly renders my SVG. However, when I attempt to run drawing = svg2rlg("svgs/上個月.svg") I ran into an issue where the svglib dependency cssselect2 doesn't know how to parse @font-face declarations.

Overall, this seems like very difficult to support because either svglib needs to support a non-standard font-family declaration OR reportlab needs to support PostScript names and svglib would need to use something other than cssselect2 to support @font-face declarations. 😖

I'm happy to spend some more time digging into this if any of you see something I missed.

replabrobin commented 4 years ago

As I understand the problems here it seems things would be improved if 1) the font search could include .ttc files and 2) if the subfontIndex could actually be a string or an integer we could allow the PostScript Name to be used as a selector for the subfont. If that were done then you could use

` try:

Try to register the font as a ttc with a subfontIndex

    # specified after the '#'.
    result = re.match(r'(.*)#(\d+)', font_name)
    if result != None:
        fn = result.group(1)
        try:
            sfi = int(result.group(2))
        except valueError:
            sfi = result.group(2)
        registerFont(TTFont(font_name, '%s.ttc' % fn, subfontIndex=sfi))
    # Try to register the font if it exists as ttf,
    # based on ReportLab font search.
    else:
        registerFont(TTFont(font_name, '%s.ttf' % font_name))
    _registered_fonts[font_name] = True
    return font_name, True`

if this makes sense then a first step is to make reportlab use either an int or string subFontIndex; I will have a go at that over the next day or so.

rocketnova commented 4 years ago

@replabrobin Thanks for taking a look at this issue. I believe that reportlab currently accepts an int for subFontIndex. It would be better if it accepted a string instead. Ideally, we would use something like:

try:
        # Try to register the font as a ttc with a subfontIndex
        # string specified after the '#'.
        result = re.match(r'(.*)#(.+)', font_name)
        if result != None:
            fn = result.group(1)
            sfi = int(result.group(2))
            registerFont(TTFont(font_name, '%s.ttc' % fn, subfontIndex=sfi))
        # Try to register the font if it exists as ttf,
        # based on ReportLab font search.
        else:
            registerFont(TTFont(font_name, '%s.ttf' % font_name))
        _registered_fonts[font_name] = True
        return font_name, True

The crucial difference here being that we would be passing in a string after the # as per W3C recommendation for TTC subfonts.

replabrobin commented 4 years ago

I have it working mostly, but there's a can of worms related to actually extracting the Postscript Font Name. If they all have the windows name that's relatively OK as the encoding is supposed to be utf_16_be, but for mac and other encodings it will be difficult as I don't think python has them all.

replabrobin commented 4 years ago

I made a change to the reportlab ttfonts.py code that should allow lookup by either an int or string. The lookup for the matching subfont by name is not particularly efficient, but it seems to work for the collections I have. If someone could try with a patched svglib and see if this goes though that would be helpful.

rocketnova commented 4 years ago

@replabrobin I tested against reportlab version 3.5.39 and discovered a couple things:

  1. My font was reporting 65536 numSubFonts on line 444, but it only has 3 subfonts, so it would index out of range on line 205.
  2. I figured out that the subfont in my font appears in Apple Font Book as "DFPYuanW3-ZhuInW", but its self.name.ustr on line446 was "DFPYuanW3-ZhuInW-BFW" (extra suffix). This might be an issue where I'm simply not sure how to correctly identify the subfont name to pass in.