improve Unicode support

dpk commented 8 years ago

Use FontForge's “build accented glyph” feature to fill in gaps in the Unicode repertoire. There should be enough character+glyph combinations possible for most European languages …

igalic commented 5 years ago

coming here from https://github.com/edwardtufte/et-book/issues/22 I really badly miss my ć…

dpk commented 3 years ago

Here’s an initial assessment (based on running PyFontaine on the Roman font and reading through the first, like, 5% of its extremely voluminous output) of which characters would be easiest to work on, and which would give the most benefit in terms of number of languages (but alas, not number of speakers) supported. Where PyFontaine only picked up a capital or small letter as missing, it’s safe to say it would be a good idea to have both if we don’t already.

Edit: Superseded (but see the HTML comments if you’re interested), see below.

dpk commented 3 years ago

We can use <component /> elements in the UFO glyph outline to build accented characters out of base characters + glyphs, like as follows:

<?xml version='1.0' encoding='UTF-8'?>
<glyph name="Adieresis" format="2">
  <advance width="414"/>
  <outline>
    <component base="A" />
    <component base="dieresis" xOffset="180" yOffset="222" />
  </outline>
</glyph>

There’s already an Adieresis character in the fonts, of course, this is just an example. But I used it to test where the diacritics should go relative to the base font. The values of 180 and 222 for xOffset and yOffset make the dieresis appear exactly where it does above the A as in the Roman font’s real, original Adieresis character (tested by layering a blue version of this character directly over a red version of the original Adieresis). Further, when I change dieresis to macron or caron or tilde, etc., it appears to put the accent marks in the right place, horizontally centred over the base character (because the accent characters are all the same width).

For the lower-case adieresis, building a + dieresis with xOffset and yOffset both 0 (the default) matches the original, but this is almost certainly not true for all base characters.

dpk commented 3 years ago

With the above in mind, here are the correct xOffset and yOffset values for characters built from the following bases plus accents above the character in question, calculated using dieresis as the combining character except where noted (and therefore possibly incorrect for others, but this will eventually be checked and noted):

Edit: Superseded, see doc/accent-positions.md

I wish I had a quicker way of finding them out …

dpk commented 3 years ago

I guess the benefit is, once I have the offset values for all fonts, I can write a script that will generate the glif files for any combination on demand automatically …

Also need to find good placements for the accent marks over characters like æ, œ, r, and w, for which some languages need accent mark versions, but for which there are no accented versions in the original ET Books fonts. There are also no accented small-cap s or z characters.

The lowercase values likely work for some accents under the letter as well: one can, at least, build an extremely passable s-cedilla out of s + cedilla with the values given above. c + cedilla (offset 33, -10) doesn’t quite match ccedilla, but offset 33, 0 for c + acute looks reasonable.

a + ogonek looks okay-ish with ogonek at offset 155, -10 ish. (I’m assuming -10 as the yOffset for all accents like cedilla which hang from, and are attached, under the character in question.) I suspect we’re not going to get better than okay-ish unless someone felt like coming in and designing an original aogonek character. Also, as I don’t read any languages which use the ogonek, I’m not really qualified to judge how good it looks in practice — I’m just comparing at large scale to a couple of serif fonts I have.

dpk commented 3 years ago

As a goal for what languages to support, it would be nice to support all the Latin-scripted languages of the European Union (that is, all of them except Greek and Bulgarian).

dpk commented 3 years ago

Okay, after some moderately successful bodging of characters in FontForge today, I think I’m ready to upgrade what I hope can be achieved from ‘Latin-scripted languages of the European Union’ to also include ‘Latin-scripted languages promoted by the European Charter for Regional or Minority languages’. Here’s a quick overview of what characters are needed, according to PyFontaine.

Support for any individual character is likely to come to the Roman font only first, then to the bold weights, then to italic, then only maybe to Display Italic. (I haven’t decided yet whether I’ll even keep maintaining Display Italic.)

Languages of the European Union

Danish, English, Estonian, Finnish, French, German, Irish, Italian, Portuguese, Spanish, Swedish

Already fully supported.

Bulgarian

Will not be supported — uses Cyrillic script.

Croatian

Č č Ć ć Đ đ

Czech

Č č Ď ď Ě ě Ň ň Ř ř Ť ť Ů ů

Dutch

Ĳ ĳ

It should be okay to specify these characters by OpenType positioning, I hope.

Greek

Will not be supported — uses Greek script.

Hungarian

Ő ő Ű ű

Latvian

Ā ā Č č Ē ē Ģ ģ Ī ī Ķ ķ ļ Ņ ņ Ū ū

Lithuanian

Ą ą Č č Ė ė Ę ę Į į Ū ū Ų ų

Maltese

Ċ ċ Ġ ġ Ħ ħ Ż ż

Polish

Ą ą Ć ć Ę ę Ń ń Ś ś Ź ź Ż ż

Romanian

Ă ă Ș ș Ț ț

Slovak

Č č Ď ď Ĺ ĺ Ľ ľ Ň ň Ŕ ŕ Ť ť

Slovene/Slovenian

Č č

European Minority/Regional Languages (which are not also EU languages)

All those not mentioned should either already be fully supported, or non-Latin, or (rarely) should be automatically covered when other, related languages are covered.

Assyrian

Will not be supported — uses Syriac script.

Arabic

Will not be supported — uses Arabic script.

Armenian

Will not be supported — uses Armenian script.

Belarusian

Will not be supported — uses Cyrillic script.

Bosnian

See Croatian.

Catalan

Also Valencian.

Ŀ ŀ

Gagauz

See Turkish.

Karaim

Most of these characters are pretty tricky. Maybe give up and don’t support this one.

Ė ė Ƣ ƣ Ꞑ ꞑ Ɵ ɵ Ś ś Ş ş Ь ь Ž ž Ź ź Ƶ ƶ

Kashubian

Ą ą Ã ã Ń ń Ż ż

Kurdish

Ş ş

Kven, Limburgish

Probably already supported.

Macedonian, Moldovan

Will not be supported — use Cyrillic script.

Romani

Č č

Russian

Will not be supported — uses Cyrillic script.

Rusyn

See Slovakian.

Scandoromani languages

Probably supported when we support all the other characters.

Sami (all dialects)

Č č Đ đ Ǧ ǧ Ǥ ǥ Ǩ ǩ Ŋ ŋ Ŧ ŧ Ʒ ǯ Ǯ ʒ ʹ

Sorbian, Upper

Ć ć Č č Ě ě Ń ń Ř ř

Sorbian, Lower

Ć ć Č č Ě ě Ń ń Ŕ ŕ Ś ś Ź ź

Tatar

Looks complicated due to multiple competing orthographies with unclear legal statuses.

Turkish

Ğ ğ İ Ş ş

Ukrainian

Will not be supported — uses Cyrillic script.

Welsh

Ŵ ŵ Ẁ ẁ Ẃ ẃ Ẅ ẅ Ŷ ŷ Ỳ ỳ

Yezidi (Kurmanji)

See Kurdish.

Yiddish

Will not be supported — uses Hebrew script.

dpk commented 3 years ago

Char.	Langs
č	10
Č	10
ń	4
Ń	4
ć	4
Ć	4
ż	3
Ż	3
ě	3
Ě	3
ą	3
Ą	3
ź	2
Ź	2
ū	2
Ū	2
ť	2
Ť	2
ş	2
Ş	2
ś	2
Ś	2
ř	2
Ř	2
ŕ	2
Ŕ	2
ň	2
Ň	2
ę	2
Ę	2
đ	2
Đ	2
ď	2
Ď	2
ỳ	1
Ỳ	1
ẅ	1
Ẅ	1
ẃ	1
Ẃ	1
ẁ	1
Ẁ	1
ʹ	1
ʒ	1
ț	1
Ț	1
ș	1
Ș	1
ǯ	1
Ǯ	1
ǩ	1
Ǩ	1
ǧ	1
Ǧ	1
ǥ	1
Ǥ	1
Ʒ	1
ŷ	1
Ŷ	1
ŵ	1
Ŵ	1
ų	1
Ų	1
ű	1
Ű	1
ů	1
Ů	1
ŧ	1
Ŧ	1
ő	1
Ő	1
ŋ	1
Ŋ	1
ņ	1
Ņ	1
ŀ	1
Ŀ	1
ľ	1
Ľ	1
ļ	1
ĺ	1
Ĺ	1
ķ	1
Ķ	1
ĳ	1
Ĳ	1
İ	1
į	1
Į	1
ī	1
Ī	1
ħ	1
Ħ	1
ģ	1
Ģ	1
ġ	1
Ġ	1
ğ	1
Ğ	1
ė	1
Ė	1
ē	1
Ē	1
ċ	1
Ċ	1
ă	1
Ă	1
ā	1
Ā	1
ã	1
Ã	1

igalic commented 3 years ago

Croatian

Č č Ć ć Đ đ

you're missing

Š š Ž ž

(i think Slovene should have the same set, but i'm not familiar with it. even though it's a South Slavic language, to my ears it sounds like West Slavic language)

Turkish

Ğ ğ İ Ş ş

you're missing

Ö ö Ü ü

this throws your count off

dpk commented 3 years ago

All those characters are already in the font — I’m only counting ones that aren’t there already. Thanks for double checking!

Char.	Langs
č	10
Č	10
ń	4
Ń	4
ć	4
Ć	4
ż	3
Ż	3
ě	3
Ě	3
ą	3
Ą	3
ź	2
Ź	2
ū	2
Ū	2
ť	2
Ť	2
ş	2
Ş	2
ś	2
Ś	2
ř	2
Ř	2
ŕ	2
Ŕ	2
ň	2
Ň	2
ę	2
Ę	2
đ	2
Đ	2
ď	2
Ď	2
ỳ	1
Ỳ	1
ẅ	1
Ẅ	1
ẃ	1
Ẃ	1
ẁ	1
Ẁ	1
ʹ	1
ʒ	1
ț	1
Ț	1
ș	1
Ș	1
ǯ	1
Ǯ	1
ǩ	1
Ǩ	1
ǧ	1
Ǧ	1
ǥ	1
Ǥ	1
Ʒ	1
ŷ	1
Ŷ	1
ŵ	1
Ŵ	1
ų	1
Ų	1
ű	1
Ű	1
ů	1
Ů	1
ŧ	1
Ŧ	1
ő	1
Ő	1
ŋ	1
Ŋ	1
ņ	1
Ņ	1
ŀ	1
Ŀ	1
ľ	1
Ľ	1
ļ	1
ĺ	1
Ĺ	1
ķ	1
Ķ	1
ĳ	1
Ĳ	1
İ	1
į	1
Į	1
ī	1
Ī	1
ħ	1
Ħ	1
ģ	1
Ģ	1
ġ	1
Ġ	1
ğ	1
Ğ	1
ė	1
Ė	1
ē	1
Ē	1
ċ	1
Ċ	1
ă	1
Ă	1
ā	1
Ā	1
ã	1
Ã	1

Char.	Langs
č	10
Č	10
ń	4
Ń	4
ć	4
Ć	4
ż	3
Ż	3
ě	3
Ě	3
ą	3
Ą	3
ź	2
Ź	2
ū	2
Ū	2
ť	2
Ť	2
ş	2
Ş	2
ś	2
Ś	2
ř	2
Ř	2
ŕ	2
Ŕ	2
ň	2
Ň	2
ę	2
Ę	2
đ	2
Đ	2
ď	2
Ď	2
ỳ	1
Ỳ	1
ẅ	1
Ẅ	1
ẃ	1
Ẃ	1
ẁ	1
Ẁ	1
ʹ	1
ʒ	1
ț	1
Ț	1
ș	1
Ș	1
ǯ	1
Ǯ	1
ǩ	1
Ǩ	1
ǧ	1
Ǧ	1
ǥ	1
Ǥ	1
Ʒ	1
ŷ	1
Ŷ	1
ŵ	1
Ŵ	1
ų	1
Ų	1
ű	1
Ű	1
ů	1
Ů	1
ŧ	1
Ŧ	1
ő	1
Ő	1
ŋ	1
Ŋ	1
ņ	1
Ņ	1
ŀ	1
Ŀ	1
ľ	1
Ľ	1
ļ	1
ĺ	1
Ĺ	1
ķ	1
Ķ	1
ĳ	1
Ĳ	1
İ	1
į	1
Į	1
ī	1
Ī	1
ħ	1
Ħ	1
ģ	1
Ģ	1
ġ	1
Ġ	1
ğ	1
Ğ	1
ė	1
Ė	1
ē	1
Ē	1
ċ	1
Ċ	1
ă	1
Ă	1
ā	1
Ā	1
ã	1
Ã	1

dpk / et-book