MichaelChirico / r-bugs

A ⚠️read-only⚠️mirror of https://bugs.r-project.org/
20 stars 0 forks source link

[BUGZILLA #15971] Inconsistent treatment of character vectors with read.table or read.csv #5440

Open MichaelChirico opened 4 years ago

MichaelChirico commented 4 years ago

Created attachment 1657 [details] tiny csv file 1

I attach a tiny .csv file, na1.csv. I created na2.csv by editing out the first column of na1.csv. (I can only attach one file, but I have pasted the contents below.)

na1.csv ==================== a, b, c 1, "b", 1 2, "", 2 , "b", 3 4, , 4 5, "NA", 5 ===========================

na2.csv =================== b, c "b", 1 "", 2 "b", 3 , 4 "NA", 5 ==========================

Here is what I get when I read them into dataframes:

df1 <- read.csv("na1.csv")
df1

a b c 1 1 b 1 2 2 2 3 NA b 3 4 4 4 5 5 NA 5

df2 <- read.csv("na2.csv")
df2
 b c

1 b 1 2 2 3 b 3 4 4 5 5

df1$b==df2$b

Error in Ops.factor(df1$b, df2$b) : level sets of factors are different

levels(df1$b)

[1] " " " " " b" " NA"

levels(df2$b)

[1] "" " " "b"

If I read them with as.is=TRUE, I again get the extra spaces in df1$b. Also, again, df1$b[5] is " NA" rather than NA.

I can't see why this would be "correct" behavior. I apologize if I've missed something here.

Thanks for your great work on R!

Best regards,

Joe Ritter


METADATA

MichaelChirico commented 4 years ago

I think this one can be closed...

The help page for read.csv references RFC 4180,

https://tools.ietf.org/html/rfc4180

which states that "Spaces are considered part of a field and should not be ignored."

In going from na1.csv to na2.csv, you didn't just trim the first column, but rather you also trimmed a leading space from second column, so as far as I can tell the behavior is exactly aligned with the documented specification.


METADATA