Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.57k stars 975 forks source link

fread not handling NAs written by fwrite #3439

Open sz-cgt opened 5 years ago

sz-cgt commented 5 years ago

Documentation for fread() claims that (see the na.strings parameter documentation):

A character vector of strings which are to be interpreted as NA values. By default, ",," for columns of all types, including type 'character' is read as NA for consistency. ,"", is unambiguous and read as an empty string. To read ,NA, as NA, set na.strings="NA". To read ,, as blank string "", set na.strings=NULL. When they occur in the file, the strings in na.strings should not appear quoted since that is how the string literal ,"NA", is distinguished from ,NA,, for example, when na.strings="NA".

However, that is not the case.

library(data.table)
dt <-
  data.table(a = c(NA, letters[2:5]),
             c = c(letters[1:2], NA, letters[4:5]),
             e = c(letters[1:4], NA))
fwrite(dt, "dt.csv")
readLines("dt.csv")
#> [1] "a,c,e" ",a,a"  "b,b,b" "c,,c"  "d,d,d" "e,e,"

all.equal(dt, fread("dt.csv"))
#> [1] "Column 'a': 'is.NA' value mismatch: 0 in current 1 in target"

all.equal(dt, fread("dt.csv", na.strings = ""))
#> [1] TRUE

Created on 2019-03-02 by the reprex package (v0.2.1)

As you can see the files have sequences of delimiters with no characters between them, but fread() is not returning them as NA values. Explicitly setting na.strings = "", produces the expected behaviour, but this too violates the documentation, which says this should produce the blank string behaviour instead (third line in the quote above).

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0      ps_1.3.0        packrat_0.5.0   digest_0.6.18   rprojroot_1.3-2
 [6] R6_2.4.0        backports_1.1.3 reprex_0.2.1    evaluate_0.13   rlang_0.3.1    
[11] fs_1.2.6        callr_3.1.1     whisker_0.3-2   rmarkdown_1.11  tools_3.5.2    
[16] xfun_0.5        compiler_3.5.2  processx_3.2.1  clipr_0.5.0     htmltools_0.3.6
[21] knitr_1.21     
ysaidani commented 2 years ago

To rephrase the issue: The fread() help isn't consistent with the function's actual default behaviour.

Possible solutions:

  1. Adjust the default behaviour to na.strings=getOption("datatable.na.strings",""), see #4288. It seems like this pull request doesn't pass a number of checks, so in the meantime one could...
  2. Adjust the documentation to describe the default behaviour correctly.