dcomtois / summarytools

R Package to Quickly and Neatly Summarize Data
502 stars 77 forks source link

UTF-8/odd character handling causes header headaches #188

Open SimpleAddress4390 opened 1 year ago

SimpleAddress4390 commented 1 year ago

Hi. Love dfSummary. I am processing large numbers of dataframes where some fields have Hebrew characters. I've been able to isolate an example where the original column text causes the dfSummary headers to become RMarkdown headings.

By using stringr to remove everything but alpha/numeric and punctuation, it works, but that approach of course assumes I know which fields to process before passing to dfSummary.

Is this just a known limitation, or a bug, or ...

I've provided a reproducible example RMD and htm examples of when it fails and when it works.

dfSummary-issue-20230731.zip

thanks for any insight.

dcomtois commented 1 year ago

Hello,

I notice you use the method = argument in the dfSummary() call directly; take a closer look at the vignette (https://cran.r-project.org/web/packages/summarytools/vignettes/rmarkdown.html), you'll see that you need to use print(), i.e.:

print( dfSummary(...), method = 'render')

You'll see a big difference in the rendering... Hope this resolves the issue, and sorry for the delay (I suggest you try StackOverflow to get a quicker response)

SimpleAddress4390 commented 1 year ago

Thanks for the reply. I'll look over at StackOverflow. I did know about the Print() and had tested that (but removed to simply the interactions). It fails with Print as well. I have traced the issue to something about the characters. Applying a UTF8 Normalize routine seems to fix it, but the fix is the data 'per field' not with the routine options.

These two strings do not match using "==" str(a)

chr "בית אל, ניידת - 7"

str(b)

chr "בית אל, ניידת - 7"

charToRaw(a) charToRaw(b)

charToRaw(a)

[1] d7 91 d7 99 d7 aa c2 a0 d7 90 d7 9c 2c c2 a0 d7 a0 d7 99 d7 99 d7 93

d7 aa c2 a0 2d c2 a0 37

charToRaw(b)

[1] d7 91 d7 99 d7 aa 20 d7 90 d7 9c 2c 20 d7 a0 d7 99 d7 99 d7 93 d7 aa

20 2d 20 37

These two DO match after performing mutate (fixedString=utf8_normalize(badString, map_case=TRUE,map_compat=TRUE,map_quote=TRUE,remove_ignorable=TRUE))

Again, thanks!

On Sun, Aug 20, 2023 at 2:03 PM Dominic Comtois @.***> wrote:

Hello,

I notice you use the method = argument in the dfSummary() call directly; take a closer look at the vignette ( https://cran.r-project.org/web/packages/summarytools/vignettes/rmarkdown.html), you'll see that you need to use print(), i.e.:

print( dfSummary(...), method = 'render')

You'll see a big difference in the rendering... Hope this resolves the issue, and sorry for the delay (I suggest you try StackOverflow to get a quicker response)

— Reply to this email directly, view it on GitHub https://github.com/dcomtois/summarytools/issues/188#issuecomment-1685255551, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWXT26BTCYCOQ53O5JXV6TXWHVJFANCNFSM6AAAAAA26YDFDE . You are receiving this because you authored the thread.Message ID: @.***>

dcomtois commented 1 year ago

Ok I'll try and look into it in more details, in the meantime feel free to share new insights here! Thx