Open Kodiologist opened 2 years ago
I believe we're just following the read.csv behavior here:
https://rdrr.io/r/utils/read.table.html
This also follows advice in, e.g., WRE, that only specifically calls out latin1:
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Character-encoding-issues
If you could point to any documentation backing you up here, and especially if you could show that we are inconsistent w.r.t. base behavior, that would be much appreciated.
I didn't realize this was a problem throughout R core, to the extent that the internal enum value is CE_LATIN1
. I shouldn't be that surprised, though.
If you could point to any documentation backing you up here
Are you asking for documentation that "Windows-1252" is a name for this encoding, that it's the correct name, or that "Latin-1" is ambiguous?
and especially if you could show that we are inconsistent w.r.t. base behavior
I can confirm that the name "Windows-1252" is recognized by e.g. file
:
$ python3 -c 'with open("/tmp/foo.txt", "w", encoding = "Windows-1252") as o: o.write("“hello”\n")'
$ R --vanilla --slave -e 'readLines(file("/tmp/foo.txt", encoding = "Windows-1252"))'
[1] "“hello”"
You can tell that R (and Python) aren't just using a funny name for ISO-8859-1 here because the characters “
and ”
don't exist in ISO-8859-1.
"Latin-1" is accepted (as documented), but presumably "Latin-1" here actually means Windows-1252, as it does in many cases, not ISO-8859-1. The correct name ought to be recognized.