MichaelChirico / r-bugs

A ⚠️read-only⚠️mirror of https://bugs.r-project.org/
20 stars 0 forks source link

[BUGZILLA #15625] read.table: Incorrect handling of character nuls over the first 5 lines #5186

Closed MichaelChirico closed 4 years ago

MichaelChirico commented 4 years ago

Created attachment 1546 [details] Test csv file

Happy new year everyone,

I would like to report what seems to me like a bug in the read.table function, which in return affects the read.csv function as well.

I have a csv file which unfortunately contains a column full of \x00 values. While the scan function seems to handle that case correctly, i.e. returning an empty value for those data, it seems that the C command in read.table used to check for col width over the first 5 lines reads them incorrectly and interprets them as an end of line command.

The result is that the first 5 lines (if there's no headers) are cut at the place of the \x00, while the next ones which are read using scan directly are not.

Here's a quick example :

My csv file : test1.csv

ColA,ColB,ColC a,\x00,1 b,\x00,1 c,\x00,1 d,\x00,1 e,\x00,1

read.csv(test.csv) returns :

ï..ColA ColB ColC 1 a NA NA 2 b NA NA 3 c NA NA 4 d NA NA 5 e NA 1 6 f NA 1

This is probably very dependant on the encoding of the \x00 value, which I unfortunately cannot guarantee.

I have already done about every combination of quote, file.encoding and separators.

I doubt that changing the na.strings parameter would change the behavior since the C function C_readtablehead doesn't take na.strings as argument.

Any help is welcome,

Best regards

Matthieu Petiteville


METADATA

MichaelChirico commented 4 years ago

I think the bug is that this works at all, even for scan().

An actual reproducible example would have helped: I presume you meant

read.csv("test1.csv")

and with that I do not get what you show. Also, the file you attach and the listing you give are different.

AFAICS the issue is embedded nuls, and not "\x00" values. Embedded nuls have not been supported in R for many years. I have added a warning for this case.


METADATA

MichaelChirico commented 4 years ago

Two further comments:

1) this file has a BOM, so should have been read with

fileEncoding = "UTF-8-BOM"

unless in a UTF-8 locale on a platform which skips BOMs (e.g. OS X).

2) The 'skipNul' argument available in R-devel (3.1.0-to-be) will help in this case.


METADATA

MichaelChirico commented 4 years ago

Hello Brian

Thank you for looking into it.

I have tried with the fileEncoding = "UTF-8-BOM" and still get the issue, but the headers are correct :

ColA ColB ColC 1 a NA NA 2 b NA NA 3 c NA NA 4 d NA NA 5 e NA 1 6 f NA 1

Sorry for not being cristal clear, I was trying to present the problem with as much information as I could. When opening the test1.csv joined on my first message with NPP, the values on ColB appear as "NUL", but when searching for \x00 value, NPP search select those "NUL" values, which is what lead me to believe they were indeed \x00. I am no expert in encoding so I might be completely wrong.

If you do not see this issue when using read.csv(test1.csv) on your side, I'm thinking this could be linked to a local set up as well.

Anyway, thank you for your time, I will be looking for the skipNul parameter in next version to see if it does any good to my issue.

Best regards


METADATA