LettError / glyphNameFormatter

Generate list of glyphnames from unicode names.
BSD 3-Clause "New" or "Revised" License
75 stars 9 forks source link

Broken link between upper and lower `ß` #108

Closed ryanbugden closed 10 months ago

ryanbugden commented 10 months ago

When converting between upper and lowercase, germandbls gets a bit tripped up. Here's a test script:


from glyphNameFormatter.data import lowerToUpper, upperToLower
from glyphNameFormatter.reader import u2n, u2U, U2u

Germandbls_unicode = 7838
germandbls_unicode = 223

name_upper = u2n(Germandbls_unicode)
name_lower = u2n(germandbls_unicode)

print("name_upper", name_upper)
print("name_lower", name_lower)

print("U2u(Germandbls_unicode)", U2u(Germandbls_unicode))
# Returns lowercase germandbls unicode instead of uppercase
print("u2U(germandbls_unicode)", u2U(germandbls_unicode))

print("upperToLower[Germandbls_unicode]", upperToLower[Germandbls_unicode])
# Lowercase germandbls unicode isn't in dictionary
print("lowerToUpper[germandbls_unicode]", lowerToUpper[germandbls_unicode])
name_upper Germandbls
name_lower germandbls
U2u(Germandbls_unicode) 223
u2U(germandbls_unicode) 223
upperToLower[Germandbls_unicode] 223
Traceback (most recent call last):
  File "<untitled>", line 19, in <module>
KeyError: 223
LettError commented 10 months ago

The unicode data for LATIN CAPITAL LETTER SHARP S does not list a lowercase. You can check Lib/glyphNameFormatter/data/flatUnicode.txt

#unicode 12
lowercase = "00DF   0053 0053       Ll      LATIN SMALL LETTER SHARP S"
uppercase = "1E9E       00DF    Lu      LATIN CAPITAL LETTER SHARP S"

lines = [lowercase, uppercase]
for line in lines:
    uniNumber, uniUppercase, uniLowercase, uniCategory, mathFlag, uniName, = line.split("\t")
    print(f"uniNumber:{uniNumber}, uniUppercase:{uniUppercase}, uniLowercase:{uniLowercase}, uniName:{uniName}")
uniNumber:00DF, uniUppercase:0053 0053, uniLowercase:, uniName:LATIN SMALL LETTER SHARP S
uniNumber:1E9E, uniUppercase:, uniLowercase:00DF, uniName:LATIN CAPITAL LETTER SHARP S

This release is built with unicode 12. I will have a look and see if the casing is different in a newer edition.

LettError commented 10 months ago

Unicode 15 offers the same data. Unless there is better source for capitalisation.

00DF    0053 0053       Ll      LATIN SMALL LETTER SHARP S
1E9E        00DF    Lu      LATIN CAPITAL LETTER SHARP S
LettError commented 10 months ago

Python follows the unicode rules.

Germandbls_unicode = 7838
germandbls_unicode = 223

print(chr(germandbls_unicode), chr(germandbls_unicode).upper())
print(chr(Germandbls_unicode), chr(Germandbls_unicode).lower())
ß SS
ẞ ß
LettError commented 10 months ago

Checked with the "big list"


<ucd xmlns="http://www.unicode.org/ns/2003/ucd/1.0">
   <description>Unicode 15.0.0</description>

...
      <char cp="00DF"
        na="LATIN SMALL LETTER SHARP S"
        uc="0053 0053"

...

      <char cp="1E9E"
        na="LATIN CAPITAL LETTER SHARP S"
        lc="00DF"

This is confirmed by https://www.unicode.org/charts/PDF/U0080.pdf

LATIN SMALL LETTER SHARP S
• German
• not used in Swiss High German
• uppercase is “SS” (standard case mapping),
alternatively 1E9E ẞ

So according to Unicode, 00DF is the lowercase of 1E9E. But the capital version of 00DF is still 0053 0053 (SS). I don't know why Python is so confident about its casing.

LettError commented 10 months ago
n2N("germandbls")
> germandbls
N2n("Germandbls")
> germandbls
ryanbugden commented 10 months ago

Thanks for looking into this. Do you think this is something I should petition on a Unicode level or they've already taken a solid stance to prioritize SS?

LettError commented 10 months ago

I understand it is not just an oversight. Unicode reflects the current use: ß is not expected to automatically capitalise to ẞ. The other way around is not problematic.