Encode unencoded glyphs as F0000 + hex(GID)

davelab6 commented 10 years ago

I was chatting with @twardoch today about how glyph names are the 'primary key' for fonts, because in any contemporary font you have so many unencoded glyphs, accessed with OpenType Layout logic... But unencoded glyphs are tricky to precisely call, because OTL logic is per-font. I mentioned that I might like to use the Unicode Private Use Area to encode otherwise-unencoded glyphs.

Adam kindly mentioned he already thought about this, and he concluded that the Private Use Plane A (Unicode Plane 15) is ideal for this, as its U+F0000..U+FFFFD so you can use a value of F0000 + hex(GID) to cleanly, logically, encode all unencoded glyphs.

Let's do it!

[x] add a "3.10" cmap subtable

twardoch commented 10 years ago

All it requires is a simple tool which adds a "3.10" cmap subtable which maps glyph ids to the PUP A (sequentially, by adding 0xF0000 to the glyph ID). Because the codepoints will be in the same order as the glyph IDs, you can use the space-saving cmap format 12 which only defines the start and end of the cmap mapping range. So the added size overhead is small.

vitalyvolkov commented 10 years ago

@behdad Is there any way to define that glyph is unencoded using fontTools?

davelab6 commented 10 years ago

@hash3g you can find the glyph IDs in the GlyphOrder table, eg https://github.com/hash3g/yesevaone/blob/master/YesevaOne-Regular.ttf.GlyphOrder.ttx and you can find glyphs that are encoded in the cmap table. I guess you need to make a set of the glyph names and a set of the encoded glyph names and compare them to get the set of unencoded glyphs.

twardoch commented 10 years ago

I would advocate that it would be much simpler (and more storage-effective) if you just encode ALL glyphs as U+F0000 + GID.

Check if there already is a cmap subtable with PID 3 EID 10 (3.10 for short). If it dors, skip to step 4.
Create a cmap subtable in format 12 and assign it to the cmap table as 3.10
Copy all mapping from the PID 3 EID 1 (3.1 for short) cmap subtable to the 3.10 subtable, as the spec requires 3.10 to be a superset of 3.1.
"Blindly" assign mapping to all glyphs from GlyphOrder from U+F0000 to U+F0000 + len(GlyphOrder) - 1.

This has the advantage that cmap subtable format 12 uses an efficient storage for continuous code-to-GID ranges. With my method, you'll only create one such range, so it'll only add a few bytes to the size, and will be very fast.

This approach has an additional benefit: As a user of such font, I am are not forced to address the properly (i.e. via Unicode) glyphs encoded using the F0000+ codes.

I can still use the proper Unicodes. But if I do so, the browser/app will always perform the Unicode processing and default OpenType Layout shaping for complex scripts. So I won't really have the guarantee that the glyph I'm seeing is actually the glyph assigned to the Unicode codepoint in the font's cmap. It will be for most Unicodes but for some codepoints, the "Unicode+OTL magic" will kick in.

But if I address even the "properly" encoded glyphs using the U+F000+ codepoint, I will have a WYSIWYG guarantee. Even more: with harfbuzz.js, I can run a JS port of HarfBuzz in the browser, take the output GIDs, add F000+ to them and have my own explicit custom OTL processing if I need to. So I'm completely in control and independent of any "browser magic".

twardoch commented 10 years ago

Here is my code that does exactly what I described above.

#! /usr/bin/python
# -*- coding: utf-8 -*-
# 
# pyftaddspuaabygids.py
# Map all glyphs to the Supplementary PUA-A plane (U+F0000..U+FFFFF) 
# by 0xF0000 + glyphID
#  
# Copyright (c) 2014 by Adam Twardoch
# 
# Licensed to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import fontTools.ttLib, sys, copy

def addSPUAByGlyphIDsMappingToCMAP(ttx):
    cmap = ttx["cmap"]
    # Check if an UCS-2 cmap exists
    for ucs2cmapid in ((3, 1), (0, 3), (3, 0)): 
        ucs2cmap = cmap.getcmap(ucs2cmapid[0], ucs2cmapid[1])
        if ucs2cmap: 
            break
    # Create UCS-4 cmap and copy the contents of UCS-2 cmap
    # unless UCS 4 cmap already exists
    ucs4cmap = cmap.getcmap(3, 10)
    if not ucs4cmap: 
        cmapModule = fontTools.ttLib.getTableModule('cmap')
        ucs4cmap = cmapModule.cmap_format_12(12)
        ucs4cmap.platformID = 3
        ucs4cmap.platEncID = 10
        ucs4cmap.language = 0
        if ucs2cmap: 
            ucs4cmap.cmap = copy.deepcopy(ucs2cmap.cmap)
        cmap.tables.append(ucs4cmap)
    # Map all glyphs to UCS-4 cmap Supplementary PUA-A codepoints 
    # by 0xF0000 + glyphID
    ucs4cmap = cmap.getcmap(3, 10)
    for glyphID, glyphName in enumerate(ttx.getGlyphOrder()):
        ucs4cmap.cmap[0xF0000 + glyphID] = glyphName

def usage():
    print "Map all glyphs to the Supplementary PUA-A plane (U+F0000..U+FFFFF) by 0xF0000 + glyphID"
    print "python %s inputfile[.otf|.ttf] outputfile[.otf|.ttf]" % sys.argv[0]

if len(sys.argv) == 3:
    inpath = sys.argv[1]
    outpath = sys.argv[2]
    ttx = fontTools.ttLib.TTFont(inpath, 0, verbose=0)
    addSPUAByGlyphIDsMappingToCMAP(ttx): 
    ttx.save(outpath)
    ttx.close() 
else: 
    usage()

behdad commented 10 years ago

I categorically reject this and think it's a bad idea. Nowhere in this report I see any reasoning for why this is needed or is a good idea.

twardoch commented 10 years ago

Ah, yes. We talked with Dave about this. Sorry it didn't become clear.

The idea is not to do this for production-ready fonts but for the purpose of development, to be used within the context of document-driven type design and similar such applications.

In a way, think of it as the "debug" mode of building fonts. Such debug mode might include other options that generate some redundant data (such as, well, glyph names! :) ) which is useful while designing but when building fonts in "release" mode, this stuff should not be included.

behdad commented 10 years ago

Ok, sure. Yeah, that would be useful.

davelab6 commented 10 years ago

@behdad could you explain more about why you think this is a bad idea..? You think that if all Google Fonts have this feature, that it will increase the use of PUA characters and documents tightly bound to particular fonts in general usage?

behdad commented 10 years ago

@davelab6 for the same reasons that non-Unicode encodings are bad. This is even worse, this is full custom encoding, meaning any text encoded in those is illegible to any text processing use.

vitalyvolkov commented 10 years ago

@davelab6 please check that test and fix is applied

https://github.com/googlefonts/fontbakery-cli/commit/5acd915d47e9385ef529be646906790411bd731d

davelab6 commented 10 years ago

@behdad I am skeptical that this would find any general usage.

It its a secondary method that is not for text processing, but debugging: it is supplementing, not replacing, the unicode encodings and OTL tables.

Part of Document Driven Type Design is having good examples to refer to; specifically for the re-implementation of http://fuelproject.org/utrrs/index (which is the result of a 24-hour overnight sprint, but the concept is valid and needed.)

Since we don't have OTL processing in <canvas>, I figure this secondary encoding would be the best way to get that done. And the good examples will be in the production Fonts API.

davelab6 commented 10 years ago

@hash3g for now, can you make this optional in the same way as fontcrunch is optional, via bakery.yml and set up page?

davelab6 commented 8 years ago

Per TypeThursday's Laura Worthington article we should consider this, perhaps only for display fonts, if its become important for casual users of desktop fonts.

anthrotype commented 8 years ago

What's this article you are referring to?

davelab6 commented 8 years ago

Oh its not out yet. Stay tuned.

anthrotype commented 8 years ago

ok. I hope you don't want to revive the idea of using PUA codes in released fonts...

behdad commented 8 years ago

Oh its not out yet. Stay tuned.

lol. Ping us when it is. That said, people have had bad ideas forever; doesn't mean we should support them. I'm more willing to implement a HarfBuzz tool to render arbitrary glyphs than to add a hack in fonttools.

davelab6 commented 8 years ago

The issue is that a lot of text environments that users are using do not support OTL; for text typefaces this isn't really an issue, but for display types then users really do want that particular glyph from that particular font, and for it to work everywhere. When composing a 3 word text, the scenarios where PUA text doesn't make sense are less important.

davelab6 commented 8 years ago

@anthrotype @behdad the article is at https://medium.com/type-thursday/casual-users-and-the-font-market-a1c5c2f19149

twardoch commented 8 years ago

I used to be anti-PUA for a long time but now I’m more inclined to support it, because the software makers have failed to support OpenType Layout for the last 15 years.

However, I’m more inclined to use the Supplementary PUA plane because it creates another “soft” hurdle.

I’m not much helped by PUA-less fonts if OpenType gets supported widely 20 years after I’m dead :)

BTW, instead of PUA, I’d actually be happier to produce “feature-frozen TTCs” that at least expose the most important 1:1 GSUB font features (small caps, stylistic sets, oldstyle nimerals) via the cmap as supplementary font menu items (Source Han does this). It’s a cleaner solution than PUA, with little overhead, and it seems that most modern OSes and apps do support TTCs. CFF-based TTCs have much less backwards compatibility than TT-based TTCs, though.

twardoch commented 8 years ago

The problem with TTCs though is that hardly any commercial font distributor is equipped to carry/sell them.

twardoch commented 8 years ago

Of course I understand why Behdad categorically (LOL) rejects this. It essentially undermines the last ten years of his work. :) With HarfBuzz, Behdad actually has made OpenType Layout much more attainable towards “normal” software developers, and from a “hobby clone” project has turned HB into being on par with the reference inplementation, or at times, surpass it.

The actual problem is actually the primitive UX behind font selection and glyph input. So perhaps PUA is really just a silly little fix that offers comparatively little benefit, while the downsides with overusing it are potentially large.

davelab6 commented 8 years ago

I can see the case for pyfeatfreeze based TTFs... but will TTCs work for users of https://en.wikipedia.org/wiki/Creative_Writer or https://en.wikipedia.org/wiki/3D_Movie_Maker or whatever it is kids use these days? And what about OTFs?

twardoch commented 8 years ago

Microsoft has been shipping Cambria only as TTC since 2007, and numerous Chinese fonts well before that. All their APIs support TT-based TTCs well. Only a small fraction very obscure apps that do their own font handling might have a problem. Cambria is a major font, default in many documents. I've never heard any problems related to this, not in printing or any noteworthy apps. Windows APIs handle this completely transparently. CFF--based TTC support is newer on Windows, not sure how new.

OS X handles TTCs (both TT and CFF) since at least 10.7, maybe older. The OS X upgrade rate is much faster given thir upgrades were always cheap or, recently, free. Most Apple system fonts are TTC, some CFF-based, so Apple must be confident their APIs handle it well.

twardoch commented 8 years ago

By CFF-based TTC, I mean OTF-based. They have the same extension, .ttc.

davelab6 commented 8 years ago

I agree, although the upgrade rate is slowed because of hardware requirements; there are some minority of users stuck on 10.6 old hardware (and I believe some are stuck out of choice thanks to allegiance to old 'trusty' FL5 versions... :)

I wonder what Laura thinks.

twardoch commented 8 years ago

@kenlunde might know what repercussions of shipping CFF-based TTCs are, given Adobe's decision to ship Source Han as CFF-based TTCs but also in several split forms.

davelab6 commented 8 years ago

On 8 February 2016 at 15:00, Adam Twardoch notifications@github.com wrote:

However, I’m more inclined to use the Supplementary PUA plane because it creates another “soft” hurdle.

What does this look like for the newbie customer described in the interview?

twardoch commented 8 years ago

Ps. If Lato had small caps or another big 1:1 feature, I would have shipped it as featfrozen TTCs from day 1. But the number of OT glyphs in Lato is tiny so I didn't bother. If we ever add small caps, I'll make the official release as TTCs.

kenlunde commented 8 years ago

@twardoch: The only downside to shipping OTCs is that Windows doesn't support them. Adobe apps, CS6 and higher, installed on Windows support OTCs, but the font resource needs to be installed into the app's private font folder. OS X started to support OTCs from Version 10.8, if memory serves. It's a superb delivery format for Source Han Sans.

twardoch commented 8 years ago

Laura is a somewhat special case. She makes calligraphic fonts where only 30% or less of the glyphs are encoded. People pay for her fonts, then cannot use them in some contexts.

The only opensource font of that kind that comes to my mind is Pecita, where users might want to insert glyphs individually. SPUA should not make it difficult to achieve the double-click-on-glyph-palette glyph insertion. But it'd be easier to track the SPUA character codes in documents in case someone wants to cleanse them later.

The featfrozen TTC method, on the other hand, would give millions of LibreOffice/OpenOffice users access to full-blown small caps (also Microsoft Office!), and to other major features (oldstyle figures, significant stylistic sets).

I personally consider featfrozen TTCs not a hack, because they don't "break" anything. The user just changes the font in the menu, like going to an italic.

The reason I contributed my SPUA code was mostly to make special font versions for use in webfont specimens, or glyphmaps, where you really want to "look", display non-printing glyphs, or override script-specific shaping. That's what I do myself and how I use it. I never shipped an end-user font like this.

Suggestion: the OFL version of the Noto fonts should not use SPUA, but the Apache 2 versions might, for all glyphs ;)

twardoch commented 8 years ago

Ah, so Windows still does not support CFF-based TTCs. Bugger :( (Thanks, Ken!)

davelab6 commented 8 years ago

Since this is about naieve users of crappy Windows apps (you know, the ones they use to make T shirt designs and Blurb self-publishing books and so on) then the lack of Windows support makes this a non-starter. However, perhaps OTC isn't so important.

behdad commented 8 years ago

Of course I understand why Behdad categorically (LOL) rejects this. It essentially undermines the last ten years of his work. :)

That's not the reason. It's "this is font-dependent text encoding" and hence BAD. But I see your point.

adrientetar commented 8 years ago

its become important for casual users of desktop fonts.

How about making existing apps (e.g. LibreOffice) support OT features?

davelab6 commented 8 years ago

This isn't about libre software, it is about proprietary software that will never improve.

adrientetar commented 8 years ago

Way to give up Dave. Make it better than proprietary as Stallman said and ppl will use it.

davelab6 commented 8 years ago

On 10 February 2016 at 04:09, Adrien Tétar notifications@github.com wrote:

Way to give up Dave. Make it better than proprietary as Stallman said and ppl will use it

Now who is wearing the GNU T shirt? :)

davelab6 commented 8 years ago

It's "this is font-dependent text encoding" and hence BAD

This reminds me of XHTML vs HTML5. I understand why it is offensive; it is a definitive example of a dirty hack.

But an insistence of technical purity hurts users from getting their jobs done and getting on with their lives.

And, this is only and merely a fallback for when OT isn't available.

adrientetar commented 8 years ago

Now who is wearing the GNU T shirt? :)

I'm talking to you with your own words. Look, existing font editors didn't work for me so I made one.

davelab6 commented 8 years ago

You seem to still be missing my point: New software doesn't matter. This is about entrenched existing proprietary software, which can not be patched, or switched away from.

Make it better than proprietary as Stallman said and ppl will use it

Stallman doesn't say this; he says even if a libre software alternative is worse, it ought to be used because restricted software is an injustice, and categorically worse than inconvenience of a poor quality program; and in fact he says that if people are switching to libre software because it is more powerful and convenient (say, as they do with VLC) without understanding the difference in justice, they have missed his point.

I am concerned with liberating typography for all people, not only those who disagree with him and prefer convenient restricted software to inconvenient libre software - such as a simple layout application that doesn't handle OpenType but allows them to mock up T shirt designs that is integrated with a T shirt printing firm's production and mailing operations, versus Inkscape or Scribus or whatever that requires them to learn those applications UIs, theory about Unicode and OpenType feature processing, and negotiating with the T shirt vendor to accept their EPS.

twardoch commented 8 years ago

Conceptually, I'm with Dave insofar that fonts in the SFNT container are an unusually broad in the sense of OS and app coverage and lifespan. Also, people are more likely to switch fonts than apps because apps are, well, more specific to their needs.

Also: text encoding is not an "easy" thing. There are no perfect solutions. Since 2001 I kept asking the industry how OT features should be marked up in "rich text" and the answer came in 2011 with CSS font-feature-settings. That's because it wasn't obvious at all. And it still isn't perfect (in order to get just one different glyph, you needs to surround the character with a HTML span and apply a super-long CSS property, but this will likely segment the text run so that isolated character stops interacting with the rest of the line OT-wise.

It's shit. :)

davelab6 commented 8 years ago

how OT features should be marked up

This is a great point, I've added it to the wikimedia page, https://meta.wikimedia.org/w/index.php?title=Future_Global_Font_Format_Requirements&type=revision&diff=15336305&oldid=14437375

Can you tell us more about supplementary PUA?

twardoch commented 8 years ago

Supplementary Private Use Area-A is 65534 codepoints from U+F0000 to U+FFFFD, which means that practically all fonts can be encoded (as they can hold max. 64K glyphs). My idea to use it was that each glyph ID gets assigned a codepoint F0000 + GID, so something like “harfbuzz.js” could do all the GSUB+GPOS processing even if a browser doesn't support some shaping, and then such harfbuzz.js would emit a series of GIDs, and in order to display them, F0000 is added to each resulting GID and then the font’s cmap gets requested for such codepoint, which in turn let the browser actually display these glyphs. Since the F0000+ codepoints in the font corresponds exactly to the GID order, this can be stored very efficiently in the font’s cmap format 12, adding only a few bytes to the font regardless of its size.

The BMP PUA (E000-F8FF) only allows for 6400 codes so it wouldn't be enough for some fonts. Plus some fonts have a legitimate “corporate” usage of the BMP PUA (SIL, MUFI), while SPUA is really obviously “last resort”.

twardoch commented 8 years ago

Ps. In my method, ALL glyphs are encoded in the SPUA, not just the “unencoded” ones. But if the SPUA codes are used, the OS/app shaper does not apply any shaping since it knows nothing of the script. This allows me to do my own external shaping (e.g. via said harfbuzz.js or another JS shaper). So in essence, this is a simple poor man’s API for glyph access in apps that can only talk to fonts via the cmap. It a decdnt web implementation, those SPUA codes could live in some shadow DOM or something, while proper selectable/searchable text woul live “on top”.

twardoch commented 8 years ago

...or via CSS generated content. People dealing with icon fonts (Bootstrap using Font Awesome etc.) have been using PUA in an elegant way that does not expose garbage to search. But it’s still fonts rather than SVG, so it’s fast, cacheable, works everywhere etc.

It's essentially a debate about treating users as smart vs. stupid. I prefer to treat users as reasonably smart. Users obviously prefer to just type their text on the keyboard, and that'll use normal Unicode. They will resort to some glyph palette insertion or some fancy PUA codes only if they're really “desperate” i.e. when they really have no other choice.

davelab6 commented 8 years ago

They will resort to some glyph palette insertion or some fancy PUA codes only if they're really “desperate” i.e. when they really have no other choice.

Right, that is why Laura uses the BMP PUA, and I agree that this would be better for her (and generally.)

One remaining question for me is if this should be done for all fonts or only OT-intense display fonts.

twardoch commented 8 years ago

I don’t have an opinion on that. However, I want to add one thing: in early OpenType days, Adobe used a portion of the BMP PUA as a “corporate use area”, where they standardized certain codes for things like small caps, oldstyle numerals or certain ligatures. So, an oldstyle “3” orva smallcap “A” always had a certain code regardless of the Adobe font used. Now that was a bad idea because this practice created an illusion that these codes had some claim of universality, or longtime relevance. So they stopped using PUA after a few years.

But with purely font-specific encoding, I don’t see this as a problem. If you have a series of SPUA codepoints assigned to correspond to GIDs in a specific font, then everyone agrees that no machine “knows”, or is expected to know, what any of these codepoints “mean”. As long as all sides agree that no presumptions can be made, I think it’s fine.

In Adobe’s case U+F761 was semi-standardized as “smallcap A”, and all their early OTFs used U+F761 as small-cap A, so the danger was that some apps might start expecting that U+F761 just “means smallcap A”. But with the purely GID-oriented SPUA, U+F0761 will mean something else with every single font. So it really is “private”, and substituting fonts will yield unecpected results.

Which is fine because users will more likely not except any stability of this encoding and will use it mostly as an input mechanism for specific glyphs in very specific situations — often with the goal being print, or laser cut, or automated engraving. Most of these laser cut or engraving apps have no OT features UI and never will be.

So SPUA entry may be the only way for the user to get work done. If the user could find a better method, they’d already be using it.

davelab6 commented 8 years ago

As long as all sides agree that no presumptions can be made, I think it’s fine.

I worry that glyphs/fontmake might create predictable map from common unicodes to GID ordering...

fonttools / fontbakery

Encode unencoded glyphs as F0000 + hex(GID) #388