Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 986 forks source link

`fread` and `fwrite` character csv file with last column contains NA #5080

Open XilinMao opened 3 years ago

XilinMao commented 3 years ago

# Minimal reproducible example

I write character data.table to a csv file. The last column contain NA

test <- matrix(letters[1:12], 3, 4)
test[2, 4] <- NA
test <- data.table(test)

test
#    V1 V2 V3   V4
# 1:  a  d  g    j
# 2:  b  e  h <NA>
# 3:  c  f  i    l

fwrite(test, "~/tmp/test.csv")

The csv file like this:

V1,V2,V3,V4
a,d,g,j
b,e,h,
c,f,i,l

When I read it, it becomes empty string, not NA

test_2 <- fread("~/tmp/test.csv")

test_2
#    V1 V2 V3 V4
# 1:  a  d  g  j
# 2:  b  e  h   
# 3:  c  f  i  l

is.na(test_2[2, 4])
#         V4
# [1,] FALSE

test_2[2, 4] == ""
#        V4
# [1,] TRUE

I find this would not happen when NA is not in the last column, and when data is numeric type

I know I can solve it by setting na.strings="", but I want to know if this inconsistency (NA in last column or not) is a bug? And how fread process this situation?

# Output of sessionInfo()

# > sessionInfo()
# R version 3.6.3 (2020-02-29)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 20.04.2 LTS
# 
# Matrix products: default
# BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libmkl_rt.so
# 
# locale:
#   [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
# [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.utf8        LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
# [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8
# 
# attached base packages:
#   [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#   [1] GRS.test_1.1       Rmosek_8.1.82      Matrix_1.2-18      rmgarch_1.3-7      rugarch_1.4-4      forecast_8.15     
# [7] vars_1.5-3         lmtest_0.9-38      urca_1.3-0         strucchange_1.5-2  sandwich_3.0-1     MASS_7.3-51.5     
# [13] RMariaDB_1.1.1     knitr_1.33         rmarkdown_2.9      ggthemes_4.2.4     ggsci_2.9          scales_1.1.1      
# [19] ggplot2_3.3.5      RColorBrewer_1.1-2 corrplot_0.90      mongolite_1.6      RJDBC_0.2-8        rJava_1.0-4       
# [25] RMySQL_0.10.22     DBI_1.1.1          dplyr_1.0.7        writexl_1.4.0      readxl_1.3.1       zoo_1.8-9         
# [31] reshape2_1.4.4     data.table_1.14.0  matrixStats_0.59.0
# 
# loaded via a namespace (and not attached):
#   [1] nlme_3.1-144                spd_2.0-1                   xts_0.12.1                  bit64_4.0.5                
# [5] numDeriv_2016.8-1.1         tools_3.6.3                 utf8_1.2.1                  R6_2.5.0                   
# [9] KernSmooth_2.23-16          colorspace_2.0-2            nnet_7.3-13                 withr_2.4.2                
# [13] tidyselect_1.1.1            bit_4.0.4                   curl_4.3.2                  compiler_3.6.3             
# [17] tseries_0.10-48             fracdiff_1.5-1              mvtnorm_1.1-2               quadprog_1.5-8             
# [21] stringr_1.4.0               digest_0.6.27               pkgconfig_2.0.3             htmltools_0.5.1.1          
# [25] rlang_0.4.11                TTR_0.24.2                  quantmod_0.4.18             generics_0.1.0             
# [29] jsonlite_1.7.2              mclust_5.4.7                magrittr_2.0.1              Rcpp_1.0.7                 
# [33] munsell_0.5.0               fansi_0.5.0                 lifecycle_1.0.0             stringi_1.7.3              
# [37] plyr_1.8.6                  grid_3.6.3                  crayon_1.4.1                lattice_0.20-40            
# [41] SkewHyperbolic_0.4-0        hms_1.1.0                   pillar_1.6.1                corpcor_1.6.9              
# [45] glue_1.4.2                  evaluate_0.14               DistributionUtils_0.6-0     vctrs_0.3.8                
# [49] nloptr_1.2.2.2              cellranger_1.1.0            gtable_0.3.0                purrr_0.3.4                
# [53] ks_1.13.2                   xfun_0.24                   ff_4.0.4                    Rmpfr_0.8-4                
# [57] pracma_2.3.3                GeneralizedHyperbolic_0.8-4 Rsolnp_1.16                 pcaPP_1.9-74               
# [61] timeDate_3043.102           truncnorm_1.0-8             tibble_3.1.2                Bessel_0.6-0               
# [65] gmp_0.6-2                   ellipsis_0.3.2            
ben-schwen commented 3 years ago

I cannot find the mentioned inconsistency since setting the NA in another column also produces an analogous result under current dev version 1.14.3.

Last column

test <- matrix(letters[1:12], 3, 4)
test[2, 4] <- NA
test <- data.table(test)
test
#>    V1 V2 V3   V4
#> 1:  a  d  g    j
#> 2:  b  e  h <NA>
#> 3:  c  f  i    l
tmp <- tempfile()
fwrite(test, tmp)

test2 <- fread(tmp)
test2
#>    V1 V2 V3 V4
#> 1:  a  d  g  j
#> 2:  b  e  h   
#> 3:  c  f  i  l

Not last column

test <- matrix(letters[1:12], 3, 4)
test[2, 3] <- NA
test <- data.table(test)
test
#>    V1 V2   V3 V4
#> 1:  a  d    g  j
#> 2:  b  e <NA>  k
#> 3:  c  f    i  l
tmp <- tempfile()
fwrite(test, tmp)

test2 <- fread(tmp)
test2
#>    V1 V2 V3 V4
#> 1:  a  d  g  j
#> 2:  b  e     k
#> 3:  c  f  i  l