Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

print(DT) display non-native representable strings as malformed letters on Windows #4787

Closed shrektan closed 4 years ago

shrektan commented 4 years ago

On Windows, when DT contains non-native representable strings, the strings will be displayed as malformed letters, even if the strings are valid UTF-8 strings and can be displayed correctly in a character vector.

This is inconvenient sometimes. For example, if I work on a machine that runs Bloomberg Terminal, where the language setting is usually English, then anything contains Chinese letters in DT will be impossible to read. It's a pain to work with such environment and usually causes confusion for whom not familiar with this issue.

The root cause of this issue is that print.data.table() calls format.data.table(), who uses format(). However, according ?format, it returns a string that " in the current locale's encoding". So, UTF8 strings that can't be represented in the current locale will be converted into unreadable letters.

https://github.com/Rdatatable/data.table/blob/63632e6f55f1f5289c689edab37f6a69d2df25cf/R/print.data.table.R#L173

(Note this example is only reproducible on a Windows computer with Chinese language. You may try to change the locale or replace the utf8 strings with other non-native-representable ones on other Windows machines)

library(data.table)
utf8 = "fa\u00e7ile"
text = c(utf8, "aaaaaaaa")
dt = data.table(A = text)
dt
dt$A
image

Note, it's essentially an R issue. print.data.frame() suffers the same as it calls format.data.frame() internally. However, I think we may be able to support better for this situation, if it's not too difficult.

jangorecki commented 4 years ago

It may be always good to ask base R if they are considering to improve that, then we don't have to do anything :)

shrektan commented 4 years ago

Filed an issue on https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17960

jangorecki commented 4 years ago

@shrektan thanks! what could also be useful to add to bugzilla report is print output after applying the fix, so it is easy to see "before" vs "after"

shrektan commented 4 years ago

I'm going to close this, as Tomas Kalibera points out in https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17960#c3 ,

This is a known problem, non-trivial printing will not work with non-representable characters even with RGui. The code for printing is too complicated for this to be easy to fix, and more time we would spend on this issues, longer it would take to have a stable version of R working in UTF-8 on Windows.

And according to UTF-8 build of R and CRAN packages :

Windows 10 (November 2019 release and newer) allows applications to use UTF-8 as their native encoding when interfacing both with the C library (needs to be UCRT) and with the operating system. This new Windows feature, present in Unix systems for many years, finally allows R on Windows to work reliably with all Unicode characters.

I believe R-core is starting (and probably available in the not-far-away future) to release a Windows version of R that uses UTF-8 as the native encoding, where most of the pain from using a non-UTF8 encoding in Windows, will be resolved from the root (finally). I'm looking forward to that day so much.

This issue will be automatically resolved at that day and I don't want to add other complicates for the Encoding things unless the solution is very straightforward and easy.

So I'm closing this issue for now.

Thanks.