gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
303 stars 45 forks source link

stri_width some codepoints need fixing (grave accent, eszett, etc.) #429

Closed hughjonesd closed 3 years ago

hughjonesd commented 3 years ago

I'm not sure if this is a bug or even anyone's fault, but the recent stringi update broke my "huxtable" package on CRAN by changing the behaviour of stri_pad_right and friends - presumably because of linking to the new ICU4C library. Did anyone do a reverse dependency check? It would have been nice to get warning of this. huxtable uses stringi via stringr, so that may be relevant.

dankearney commented 3 years ago

I noticed that one of my projects which uses stringi as a dependency is also failing with 1.6.1.

hughjonesd commented 3 years ago

One underlying issue is that nchar(..., type = "width") and stringi::stri_width() now give different answers. (Maybe they always could, just that now it has bitten me.)

R 4.0.3, stringi version 1.6.1:

> nchar("`", type = "w")
[1] 1
> stringi::stri_width("`")
[1] 0

stringi 1.5.3:

> nchar("`", type = "w")
[1] 1
> stringi::stri_width("`")
[1] 1
gagolews commented 3 years ago

Yes, 1.6.1 was a huge update and unfortunately this will cause a couple of packages to break. Once all issues are identified (which will take 1-2 more weeks), I will submit an updated version to CRAN asap.

Thank you for digging into the problem, in this case it is indeed caused by the new stri_width which now has a few obvious flaws, including the grave accent being classified as of width 0 (however, it is an accent, maybe it should be of width 0 after all).

Fix is on the way.

gagolews commented 3 years ago

@dankearney "I noticed that one of my projects which uses stringi as a dependency is also failing with 1.6.1." — could you please be more specific? Which package? Which tests are failing, is it due to the issues with padding/string width? Thank you for contributing to the testing of stringi.

gagolews commented 3 years ago

[note to self] Don't set width of GC=Sk to 0

[note to self] Don't set width of U_EA_AMBIGUOUS to 2

gagolews commented 3 years ago

This is now fixed. All characters <= U+CCFF of width 0 are the same as in 1.5.3. However, v1.6.2 now more eagerly classifies characters as of width 2, including emojis and other symbols. This should not be a problem in you package, though.

hughjonesd commented 3 years ago

That's ace. Is it likely to hit CRAN before May 23? My package gets archived then, and it'll take some time for tests to start working with your update, so want to be careful.

gagolews commented 3 years ago

Absolutely; no worries.

gagolews commented 3 years ago

I've now submitted v1.6.2 to CRAN. :crossed_fingers: