GSUB (and GPOS?) bloated

behdad commented 10 years ago

I see multiple copies of the same features. Eg:

  <FeatureRecord index="0"> 
    <FeatureTag value="aalt"/> 
    <Feature> 
      <!-- LookupCount=2 --> 
      <LookupListIndex index="0" value="0"/> 
      <LookupListIndex index="1" value="1"/> 
    </Feature> 
  </FeatureRecord> 
  <FeatureRecord index="1"> 
    <FeatureTag value="aalt"/> 
    <Feature> 
      <!-- LookupCount=2 --> 
      <LookupListIndex index="0" value="0"/> 
      <LookupListIndex index="1" value="1"/> 
    </Feature> 
  </FeatureRecord> 
  <FeatureRecord index="2"> 
    <FeatureTag value="aalt"/> 
    <Feature> 
      <!-- LookupCount=2 --> 
      <LookupListIndex index="0" value="0"/> 
      <LookupListIndex index="1" value="1"/> 
    </Feature> 
  </FeatureRecord> 
  <FeatureRecord index="3"> 
    <FeatureTag value="aalt"/> 
    <Feature> 
      <!-- LookupCount=2 --> 
      <LookupListIndex index="0" value="0"/> 
      <LookupListIndex index="1" value="1"/> 
    </Feature> 
  </FeatureRecord> 
  <FeatureRecord index="4"> 
    <FeatureTag value="aalt"/> 
    <Feature> 
      <!-- LookupCount=2 --> 
      <LookupListIndex index="0" value="0"/> 
      <LookupListIndex index="1" value="1"/> 
    </Feature> 
  </FeatureRecord>

It makes reading the GSUB tables very hard unnecessarily. The snippet above is from NotoSansCJK-Regular, but I suppose it's the same with Source Han Sans.

kenlunde commented 10 years ago

There is no bloat in the actual binary data, because what appear to be separate instances of the same feature are pointing to the same binary data. For the particular font you referenced, the 12 apparent duplicate instances correspond to the 12 script+language declarations. Properly declaring scripts and languages is a necessary part of fonts, especially Pan-CJK ones.

behdad commented 10 years ago

Thanks Ken. I understand that these collapse in the binary. And I understand that separate language systems are useful. But there's no reason I'm aware of for not sharing features amongst multiple language systems.

As I said, it just makes reading the font tables harder. The font has over a hundred features when in reality a dozen will do.

No action needed. Just wanted to bring it up to your attention.

moyogo commented 10 years ago

@behdad this is common to many fonts and is pretty much standard practice. From what I understand both AFDKO and VOLT do this. But I agree, it would make more sense if feature records weren't repeated for no reason.

kenlunde commented 10 years ago

Some background is that different clients expect different degrees of script+language declaration. For single-language fonts, such as all of the OpenType CJK fonts we have developed to date, we declare the appropriate scripts, which is about a half-dozen (DFLT, hani, kana, hang, latn, grek, cyrl), but only the 'dflt' language. Pan-CJK fonts such as Source Han Sans and Noto Sans CJK require that non-default languages also be declared for the appropriate scripts. This is especially important for the 'locl' GSUB feature, but non-default languages are also declared elsewhere, such as in the 'vert' GSUB feature to handle language-specific vertical forms.

adobe-fonts / source-han-sans

GSUB (and GPOS?) bloated #46