Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.58k stars 977 forks source link

`fread` doesn't accept `encoding = "Windows-1252"` #5179

Open Kodiologist opened 2 years ago

Kodiologist commented 2 years ago

"Latin-1" is accepted (as documented), but presumably "Latin-1" here actually means Windows-1252, as it does in many cases, not ISO-8859-1. The correct name ought to be recognized.

MichaelChirico commented 2 years ago

I believe we're just following the read.csv behavior here:

https://rdrr.io/r/utils/read.table.html

This also follows advice in, e.g., WRE, that only specifically calls out latin1:

https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Character-encoding-issues

If you could point to any documentation backing you up here, and especially if you could show that we are inconsistent w.r.t. base behavior, that would be much appreciated.

Kodiologist commented 2 years ago

I didn't realize this was a problem throughout R core, to the extent that the internal enum value is CE_LATIN1. I shouldn't be that surprised, though.

If you could point to any documentation backing you up here

Are you asking for documentation that "Windows-1252" is a name for this encoding, that it's the correct name, or that "Latin-1" is ambiguous?

and especially if you could show that we are inconsistent w.r.t. base behavior

I can confirm that the name "Windows-1252" is recognized by e.g. file:

$ python3 -c 'with open("/tmp/foo.txt", "w", encoding = "Windows-1252") as o: o.write("“hello”\n")'
$ R --vanilla --slave -e 'readLines(file("/tmp/foo.txt", encoding = "Windows-1252"))'                                                                   
[1] "“hello”"

You can tell that R (and Python) aren't just using a funny name for ISO-8859-1 here because the characters and don't exist in ISO-8859-1.