Open zx8754 opened 5 years ago
Is there any news on this / any workaround without resorting to the slow read.table? I am also trying to read a large space separated file in R and this problem is so frustrating (especially since your SO post describes how they had a solution then made it stop working). What makes it worse is that the file is gzipped, and even read.table seems to have trouble with the double-space containing line.
I just found this, with the same problem. It's interesting as I saw noted that two consecutive commas are correctly interpreted to imply a blank value between, but two consecutive spaces are not.
This is still a problem in the latest version, 1.14.2
Having the same issue with v1.14.8
confirming, this is still an issue with current R-devel and data.table-1.15.0. Here is R code to reproduce
text <- "c1 c2 c3 c4 c5 c6
r1 0 1 2 3 4
r2 0 3 4
r3 0 1 2 3 4"
read.table(text=text, strip.white = FALSE, sep = " ", na.strings = "")
data.table::fread(text=text, strip.white=FALSE)
here are the results on my system:
> text <- "c1 c2 c3 c4 c5 c6
+ r1 0 1 2 3 4
+ r2 0 3 4
+ r3 0 1 2 3 4"
> read.table(text=text, strip.white = FALSE, sep = " ", na.strings = "")
V1 V2 V3 V4 V5 V6
1 c1 c2 c3 c4 c5 c6
2 r1 0 1 2 3 4
3 r2 0 <NA> <NA> 3 4
4 r3 0 1 2 3 4
> data.table::fread(text=text, strip.white=FALSE)
c1 c2 c3 c4 c5 c6
<char> <int> <int> <int> <int> <int>
1: r1 0 1 2 3 4
Warning message:
In data.table::fread(text = text, strip.white = FALSE) :
Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0 3 4>>
> sessionInfo()
R Under development (unstable) (2024-01-23 r85822 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Phoenix
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.4.0 tools_4.4.0 data.table_1.15.0
>
confirming, this is still an issue with current R-devel and data.table-1.15.0. Here is R code to reproduce
text <- "c1 c2 c3 c4 c5 c6 r1 0 1 2 3 4 r2 0 3 4 r3 0 1 2 3 4" read.table(text=text, strip.white = FALSE, sep = " ", na.strings = "") data.table::fread(text=text, strip.white=FALSE)
here are the results on my system:
> text <- "c1 c2 c3 c4 c5 c6 + r1 0 1 2 3 4 + r2 0 3 4 + r3 0 1 2 3 4" > read.table(text=text, strip.white = FALSE, sep = " ", na.strings = "") V1 V2 V3 V4 V5 V6 1 c1 c2 c3 c4 c5 c6 2 r1 0 1 2 3 4 3 r2 0 <NA> <NA> 3 4 4 r3 0 1 2 3 4 > data.table::fread(text=text, strip.white=FALSE) c1 c2 c3 c4 c5 c6 <char> <int> <int> <int> <int> <int> 1: r1 0 1 2 3 4 Warning message: In data.table::fread(text = text, strip.white = FALSE) : Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0 3 4>> > sessionInfo() R Under development (unstable) (2024-01-23 r85822 ucrt) Platform: x86_64-w64-mingw32/x64 Running under: Windows 10 x64 (build 19045) Matrix products: default locale: [1] LC_COLLATE=English_United States.utf8 [2] LC_CTYPE=English_United States.utf8 [3] LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.utf8 time zone: America/Phoenix tzcode source: internal attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_4.4.0 tools_4.4.0 data.table_1.15.0 >
what about na.strings=" "
?
@MichaelChirico
what about na.strings=" "?
That doesn't correspond to the desired result here. Spaces should be delimiters, not NAs--the lack of a value between spaces should be interpreted as an NA.
And in any case, trying it regardless just yields different errors:
> data.table::fread(text=text, strip.white=FALSE, na.strings = " ")
Error in data.table::fread(text = text, strip.white = FALSE, na.strings = " ") :
na.strings[1]==" " consists only of whitespace, ignoring. But strip.white=FALSE. Use strip.white=TRUE (default) together with na.strings="" to turn any number of spaces in string columns into <NA>
> data.table::fread(text=text, strip.white=TRUE, na.strings = " ")
c1 c2 c3 c4 c5 c6
1: r1 0 1 2 3 4
Warning messages:
1: In data.table::fread(text = text, strip.white = TRUE, na.strings = " ") :
na.strings[1]==" " consists only of whitespace, ignoring. Since strip.white=TRUE (default), use na.strings="" to specify that any number of spaces in a string column should be read as <NA>.
2: In data.table::fread(text = text, strip.white = TRUE, na.strings = " ") :
Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0 3 4>>
Oh I see, you probably meant na.strings = ""
which was missing from tdhock's example. Still doesn't work:
> packageVersion("data.table")
[1] ‘1.15.0’
> text <- "c1 c2 c3 c4 c5 c6
+ r1 0 1 2 3 4
+ r2 0 3 4
+ r3 0 1 2 3 4"
> data.table::fread(text=text, strip.white=FALSE, na.strings = "")
c1 c2 c3 c4 c5 c6
<char> <int> <int> <int> <int> <int>
1: r1 0 1 2 3 4
Warning message:
In data.table::fread(text = text, strip.white = FALSE, na.strings = "") :
Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0 3 4>>
Pasteable:
text <- "c1 c2 c3 c4 c5 c6
r1 0 1 2 3 4
r2 0 3 4
r3 0 1 2 3 4"
data.table::fread(text=text, strip.white=FALSE, na.strings = "")
Example input text file - fileTest.txt:
Adding a screenshot to show there are 3 spaces on row r2 between 0 and 3, i.e.: values are missing for c3 and c4.
Using R version 3.4.0 and data.table 1.10.4 (Session info 1), below works as expected:
But fails with R version 3.5.2 and data.table_1.12.2. (Session info 2).
Other attempts, all failed:
Note
Session info 1
Session info 2