AndriSignorell / DescTools

Tools for Descriptive Statistics and Exploratory Data Analysis
http://andrisignorell.github.io/DescTools/
82 stars 18 forks source link

Incerrect StrTrunc on strings with accents #154

Open mayeulk opened 1 month ago

mayeulk commented 1 month ago

Strings with accents are note handled correctly. The ellipses to the (French) phrase "Action à réaliser", with maxlen=14, should be "Action à ..." When the first word not to be printed (here: "réaliser") has an accent, then this word is partly printed (up to the last accented letter, included).

Below, only the output for "Action a realiser" is correct (but "Action a realiser" is not correct French).

`

DescTools::StrTrunc("Action à réaliser", maxlen=14, wbound = TRUE) [1] "Action à ré..." DescTools::StrTrunc("Action à realiser", maxlen=14, wbound = TRUE) [1] "Action ..." DescTools::StrTrunc("Action a realiser", maxlen=14, wbound = TRUE) [1] "Action a ..." DescTools::StrTrunc("Action réaliser", maxlen=14, wbound = TRUE) [1] "Action ré..." DescTools::StrTrunc("Action réalisée", maxlen=14, wbound = TRUE) [1] "Action réalisé..."

`

Tested with Package DescTools version 0.99.54, on R 4.4.0, Kubuntu 24.04 (UTF-8)

AndriSignorell commented 1 month ago

Thanks for this! However your are barking up the wrong tree. The culprit is base::gregexpr(), which apparently is not aware of local traditions beyond the english language... ;-)

Note the following:

gregexpr("\\b\\W+\\b", "first all next?", perl = TRUE)[[1]]
[1]  6 10
attr(,"match.length")
[1] 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

gregexpr("\\b\\W+\\b", "first àll next?", perl = TRUE)[[1]]
[1]  6 10
attr(,"match.length")
[1] 2 1

gregexpr("\\b\\W+\\b", "first àll nèxt?", perl = TRUE)[[1]]
[1]  6 10 12
attr(,"match.length")
[1] 2 1 1

As far as I see, we cannot circumvent this behaviour. May I ask you to place this directly in the R-Bugs-list?