Closed MichaelChirico closed 4 years ago
I think the bug is that this works at all, even for scan().
An actual reproducible example would have helped: I presume you meant
read.csv("test1.csv")
and with that I do not get what you show. Also, the file you attach and the listing you give are different.
AFAICS the issue is embedded nuls, and not "\x00" values. Embedded nuls have not been supported in R for many years. I have added a warning for this case.
Two further comments:
1) this file has a BOM, so should have been read with
fileEncoding = "UTF-8-BOM"
unless in a UTF-8 locale on a platform which skips BOMs (e.g. OS X).
2) The 'skipNul' argument available in R-devel (3.1.0-to-be) will help in this case.
Hello Brian
Thank you for looking into it.
I have tried with the fileEncoding = "UTF-8-BOM" and still get the issue, but the headers are correct :
ColA ColB ColC 1 a NA NA 2 b NA NA 3 c NA NA 4 d NA NA 5 e NA 1 6 f NA 1
Sorry for not being cristal clear, I was trying to present the problem with as much information as I could. When opening the test1.csv joined on my first message with NPP, the values on ColB appear as "NUL", but when searching for \x00 value, NPP search select those "NUL" values, which is what lead me to believe they were indeed \x00. I am no expert in encoding so I might be completely wrong.
If you do not see this issue when using read.csv(test1.csv) on your side, I'm thinking this could be linked to a local set up as well.
Anyway, thank you for your time, I will be looking for the skipNul parameter in next version to see if it does any good to my issue.
Best regards
Created attachment 1546 [details] Test csv file
Happy new year everyone,
I would like to report what seems to me like a bug in the read.table function, which in return affects the read.csv function as well.
I have a csv file which unfortunately contains a column full of \x00 values. While the scan function seems to handle that case correctly, i.e. returning an empty value for those data, it seems that the C command in read.table used to check for col width over the first 5 lines reads them incorrectly and interprets them as an end of line command.
The result is that the first 5 lines (if there's no headers) are cut at the place of the \x00, while the next ones which are read using scan directly are not.
Here's a quick example :
My csv file : test1.csv
ColA,ColB,ColC a,\x00,1 b,\x00,1 c,\x00,1 d,\x00,1 e,\x00,1
read.csv(test.csv) returns :
ï..ColA ColB ColC 1 a NA NA 2 b NA NA 3 c NA NA 4 d NA NA 5 e NA 1 6 f NA 1
This is probably very dependant on the encoding of the \x00 value, which I unfortunately cannot guarantee.
I have already done about every combination of quote, file.encoding and separators.
I doubt that changing the na.strings parameter would change the behavior since the C function C_readtablehead doesn't take na.strings as argument.
Any help is welcome,
Best regards
Matthieu Petiteville
METADATA