JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.8k stars 5.49k forks source link

Merge `charwidth` and `strwidth`? #20816

Closed TotalVerb closed 7 years ago

TotalVerb commented 7 years ago

Since these functions are very similar, I wonder if it is worth merging them into a single function, maybe textwidth.

Alternatively, perhaps strwidth(_) can be spelt sum(charwidth, _).

oscardssmith commented 7 years ago

I think the first suggestion could work well. The problem with the second is that in arabic, for example, final characters make it so where a letter is in a string is important. This could cause the answer to be incorrect if someone sums length of characters. Note: this is not a purely academic concern, bad code like this was responsible for a bug that allowed a text to crash iphones.

TotalVerb commented 7 years ago

Thanks, I was unaware of that. We should fix our implementation of strwidth then.

oscardssmith commented 7 years ago

A quick bit of research suggests that doing it generally, properly, and in a performant manner is impossible. We have several options: not change algorithms, produce slightly wrong results for some languages. Change algorithms: Persian, Hebrew, Arabic, and others will be ~1000x slower. Throw an error if certain characters are in the input.

TotalVerb commented 7 years ago

If someone wants a performant but incorrect algorithm, they can use sum(charwidth, _). If we are going to provide strwidth, we may as well make it correct.

JeffBezanson commented 7 years ago

strwidth(::Char) seems to make sense, since it could act as if you converted the Char to string, except possibly with a more efficient algorithm.

nalimilan commented 7 years ago

+1 for merging but not recommending sum(charwidth, s). Correctness is what matters, in particular in the present case where displaying the result is likely to be much slower than computing its width.

stevengj commented 7 years ago

It's not like charwidth is unambiguously "correct" either. utf8proc reports a reasonable value, but ultimately this depends on the font and terminal settings; see e.g. the discussion at JuliaLang/utf8proc#83.

Can you give a specific example of a Unicode string for which sum(charwidth, s) gives an unambiguously wrong result?

@oscardssmith, I'm skeptical that incorrect glyph width (i.e. computation of on-screen character size) caused the iPhone crash. I can't find a lot of technical information on the iPhone crash you seem to be referring to, but what little I can find seems to indicate that it stemmed from incorrect computation of the in-memory width (i.e. the width in bytes) of a character, which has nothing to do with the on-screen width.

JeffBezanson commented 7 years ago

My point would be that deprecating charwidth to strwidth(::Char) or textwidth(::Char) seems much safer than deprecating strwidth to sum(charwidth, s). The latter assumes much more (even if that assumption is generally valid). textwidth is also extensible to other objects that might represent text, regardless of how many characters they contain or whether they are iterable.

oscardssmith commented 7 years ago

@stevengj that may be right, I was basing it off a video by Tom Scott here https://www.youtube.com/watch?v=hJLMSllzoLA.