subset.py: latin-ext missing Latin characters in IPA, Combining diacritics

GoogleCodeExporter commented 9 years ago

The Latin extended character subset in subset.py only includes characters from 
Extended A, B, C, D, and Additional (without Vietnamese characters).

But some of those characters are only one of bicameral pair of the same letter. 
For exemple Ɛ (U+0190) is in Latin Extended-B, but its lowercase ɛ (U+025B) 
is in IPA Extensions.

Not all IPA Extensions characters are used outside of IPA, although in general 
if it has another case variant in another Latin block it probably is used in 
language orthographies.

Characters in Combining Diacritical Marks are also used in language 
orthographies in Latin script. They are used on letters withouth precomposed 
character forms, for example Yoruba uses combining dot below (U+0323) or 
combining acute (U+301) depending on how accented letters like ẹ́ is 
represented.

I don't know if you want to include those whole blocks in the latin-ext subset, 
considering not all characters are actually used in language orthography. 
However this might not be much of an issue if the font doesn't have all the 
characters of those blocks.

I can provide a list of characters I've found to be used in language 
orthographies.

Original issue reported on code.google.com by moy...@gmail.com on 7 Jan 2011 at 4:48

GoogleCodeExporter commented 9 years ago

Perhaps it needs a full-on "Africa languages alphabet" subset?

Original comment by sladen@gmail.com on 8 Jan 2011 at 3:14

GoogleCodeExporter commented 9 years ago

Sure, that's a valid subset. However IPA characters are also used
outisde of IPA and African orthographies.
For example, the letter schwa (ə U+0259 in IPA Extensions, Ə U+018F in
Latin Extended-B) is used in Azeri (spoken in Azerbaijan and Iran), or
the letter ezh (ʒ U+0292 in IPA Extensions, Ʒ U+01B7 in Latin
Extended-B) is used in Sami languages (spoken in Nordic countries and
Russia). Both are in MES-2.

It would actually make more sense to have latin-ext complete with all
Latin characters and diacritics, at least those used in language
orthographies. Having a European language subset (or renaming
latin-ext to match it's intended use) would be more appropriate, and
could be similar to MES, along the Vietnamese and African subsets.

There are other possible regional subsets like American, Asian
(including Vietnamese, pinyin), Australasian, or by use (much more
limited) like transliteration (Latin Extended Additional is full of
those), phonetic transcriptions (IPA, UPA, APA), or historical. There
are many ways to organize subsets, but regional zones is probably the
most practical.

Original comment by moy...@gmail.com on 8 Jan 2011 at 4:16

debugthings / googlefontdirectory

subset.py: latin-ext missing Latin characters in IPA, Combining diacritics #31