Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

Read space delimited file where missing values are blank "". #3658

Open zx8754 opened 5 years ago

zx8754 commented 5 years ago

Example input text file - fileTest.txt:

c1 c2 c3 c4 c5 c6
r1 0 1 2 3 4
r2 0   3 4
r3 0 1 2 3 4

Adding a screenshot to show there are 3 spaces on row r2 between 0 and 3, i.e.: values are missing for c3 and c4.

image

Using R version 3.4.0 and data.table 1.10.4 (Session info 1), below works as expected:

> fread("fileTest.txt", strip.white = FALSE)
   c1 c2 c3 c4 c5 c6
1: r1  0  1  2  3  4
2: r2  0 NA NA  3  4
3: r3  0  1  2  3  4

But fails with R version 3.5.2 and data.table_1.12.2. (Session info 2).

> fread("fileTest.txt", strip.white = FALSE)
   c1 c2 c3 c4 c5 c6
1: r1  0  1  2  3  4
Warning message:
In fread("fileTest.txt", strip.white = FALSE) :
  Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0   3 4>>

Other attempts, all failed:

fread("fileTest.txt")
fread("fileTest.txt", sep = " ")
fread("fileTest.txt", strip.white = FALSE)
fread("fileTest.txt", sep = " ", strip.white = FALSE, na.strings = c("NA", ""))
fread("fileTest.txt", sep = " ", strip.white = FALSE, quote = "")

Note

Session info 1

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /opt/gridware/apps/intel/compilers_and_libraries_2016.3.210/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.10.4

loaded via a namespace (and not attached):
[1] compiler_3.4.0

Session info 2

R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.2

loaded via a namespace (and not attached):
[1] compiler_3.5.2 tools_3.5.2   
amarron314 commented 5 years ago

Is there any news on this / any workaround without resorting to the slow read.table? I am also trying to read a large space separated file in R and this problem is so frustrating (especially since your SO post describes how they had a solution then made it stop working). What makes it worse is that the file is gzipped, and even read.table seems to have trouble with the double-space containing line.

mehrlander commented 3 years ago

I just found this, with the same problem. It's interesting as I saw noted that two consecutive commas are correctly interpreted to imply a blank value between, but two consecutive spaces are not.

dvg-p4 commented 2 years ago

This is still a problem in the latest version, 1.14.2

privefl commented 9 months ago

Having the same issue with v1.14.8

tdhock commented 9 months ago

confirming, this is still an issue with current R-devel and data.table-1.15.0. Here is R code to reproduce

text <- "c1 c2 c3 c4 c5 c6
r1 0 1 2 3 4
r2 0   3 4
r3 0 1 2 3 4"
read.table(text=text, strip.white = FALSE, sep = " ", na.strings = "")
data.table::fread(text=text, strip.white=FALSE)

here are the results on my system:

> text <- "c1 c2 c3 c4 c5 c6
+ r1 0 1 2 3 4
+ r2 0   3 4
+ r3 0 1 2 3 4"
> read.table(text=text, strip.white = FALSE, sep = " ", na.strings = "")
  V1 V2   V3   V4 V5 V6
1 c1 c2   c3   c4 c5 c6
2 r1  0    1    2  3  4
3 r2  0 <NA> <NA>  3  4
4 r3  0    1    2  3  4
> data.table::fread(text=text, strip.white=FALSE)
       c1    c2    c3    c4    c5    c6
   <char> <int> <int> <int> <int> <int>
1:     r1     0     1     2     3     4
Warning message:
In data.table::fread(text = text, strip.white = FALSE) :
  Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0   3 4>>
> sessionInfo()
R Under development (unstable) (2024-01-23 r85822 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Phoenix
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.4.0    tools_4.4.0       data.table_1.15.0
> 
MichaelChirico commented 9 months ago

confirming, this is still an issue with current R-devel and data.table-1.15.0. Here is R code to reproduce

text <- "c1 c2 c3 c4 c5 c6
r1 0 1 2 3 4
r2 0   3 4
r3 0 1 2 3 4"
read.table(text=text, strip.white = FALSE, sep = " ", na.strings = "")
data.table::fread(text=text, strip.white=FALSE)

here are the results on my system:

> text <- "c1 c2 c3 c4 c5 c6
+ r1 0 1 2 3 4
+ r2 0   3 4
+ r3 0 1 2 3 4"
> read.table(text=text, strip.white = FALSE, sep = " ", na.strings = "")
  V1 V2   V3   V4 V5 V6
1 c1 c2   c3   c4 c5 c6
2 r1  0    1    2  3  4
3 r2  0 <NA> <NA>  3  4
4 r3  0    1    2  3  4
> data.table::fread(text=text, strip.white=FALSE)
       c1    c2    c3    c4    c5    c6
   <char> <int> <int> <int> <int> <int>
1:     r1     0     1     2     3     4
Warning message:
In data.table::fread(text = text, strip.white = FALSE) :
  Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0   3 4>>
> sessionInfo()
R Under development (unstable) (2024-01-23 r85822 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Phoenix
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.4.0    tools_4.4.0       data.table_1.15.0
> 

what about na.strings=" "?

dvg-p4 commented 9 months ago

@MichaelChirico

what about na.strings=" "?

That doesn't correspond to the desired result here. Spaces should be delimiters, not NAs--the lack of a value between spaces should be interpreted as an NA.

And in any case, trying it regardless just yields different errors:

> data.table::fread(text=text, strip.white=FALSE, na.strings = " ")
Error in data.table::fread(text = text, strip.white = FALSE, na.strings = " ") : 
  na.strings[1]==" " consists only of whitespace, ignoring. But strip.white=FALSE. Use strip.white=TRUE (default) together with na.strings="" to turn any number of spaces in string columns into <NA>
> data.table::fread(text=text, strip.white=TRUE, na.strings = " ")
   c1 c2 c3 c4 c5 c6
1: r1  0  1  2  3  4
Warning messages:
1: In data.table::fread(text = text, strip.white = TRUE, na.strings = " ") :
  na.strings[1]==" " consists only of whitespace, ignoring. Since strip.white=TRUE (default), use na.strings="" to specify that any number of spaces in a string column should be read as <NA>.
2: In data.table::fread(text = text, strip.white = TRUE, na.strings = " ") :
  Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0   3 4>>
dvg-p4 commented 9 months ago

Oh I see, you probably meant na.strings = "" which was missing from tdhock's example. Still doesn't work:

> packageVersion("data.table")
[1] ‘1.15.0’
> text <- "c1 c2 c3 c4 c5 c6
+ r1 0 1 2 3 4
+ r2 0   3 4
+ r3 0 1 2 3 4"
> data.table::fread(text=text, strip.white=FALSE, na.strings = "")
       c1    c2    c3    c4    c5    c6
   <char> <int> <int> <int> <int> <int>
1:     r1     0     1     2     3     4
Warning message:
In data.table::fread(text = text, strip.white = FALSE, na.strings = "") :
  Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0   3 4>>

Pasteable:

text <- "c1 c2 c3 c4 c5 c6
r1 0 1 2 3 4
r2 0   3 4
r3 0 1 2 3 4"

data.table::fread(text=text, strip.white=FALSE, na.strings = "")