alvinlindstam / grapheme

A python package for grapheme aware string handling
MIT License
108 stars 7 forks source link

Determine visual width #16

Closed worldmaker18349276 closed 3 years ago

worldmaker18349276 commented 3 years ago

It will be great if it can determine visual width of grapheme, which is useful for wrapping or masking text in a specific region. It should work like this:

>>> unicodedata.east_asian_width("A"), grapheme.width("A") # narrow
('Na', 1)
>>> unicodedata.east_asian_width("A"), grapheme.width("A") # fullwidth
('F', 2)
>>> unicodedata.east_asian_width("中"), grapheme.width("中") # wide
('W', 2)
>>> unicodedata.east_asian_width("ǔ"), grapheme.width("ǔ") # ambiguous
('A', 1)
>>> unicodedata.east_asian_width("🏳️"[0]), grapheme.width("🏳️")
('N', 1)
>>> grapheme.width("🏳️‍🌈") # It show two emoji in some terminals. (related to font setting?)
1
>>> grapheme.width("🏳️‍🌈️‍🌈️‍🌈")
3
>>> grapheme.width("\u0300") # single combining character
0
>>> grapheme.width("\u200b") # zero-width space
0
>>> [grapheme.width(c) for c in "\0\t\b\r\n"] # control characters
[None, None, None, None, None]

ref: East Asian Width

alvinlindstam commented 3 years ago

I'm unfamiliar with that annex, and would probably find it difficult to understand or get the time to understand how it would relate to grapheme cluster boundaries. Does a grapheme (cluster) always get the width of the widest member of the cluster? Is the width of other things defined (other than east asian characters), if so, where?

Are you trying to align text in terminals or text tables? I'm not confident that there is a deterministic width of text, even in monospaced fonts. I believe it differs by platform, rendering engine, font data and versions of all of the previous (in the gnarly edge cases). So not sure if it could be accurately calculated.

worldmaker18349276 commented 3 years ago

Yes, I'm trying to align text in terminals with monospaced fonts, and I say the visual width of grapheme actually refers to the width that the grapheme occupy, not the visual width of glyph. east_asian_width can determine which character occupy two cells, but the problem is zero width joiner and spacing mark can increase the occupation width of grapheme clusters (such as "🏳️‍🌈️‍🌈️‍🌈"), that is hard to deal width them without the unicode database. This isn't the property defined by the standard specification, and there is already a widely used package wcwidth that can solve this problem.