Line numbers in error messages should refer to the raw file

etienne-s commented 7 years ago

Consider the following example:

src1 <- 'col1,col2
a,b
c,d,e'

fread(src1)
# Error in fread(src1) : 
#   Line 3 from sampling jump 0 starting <<c,d,e>> has more than the expected 2 fields. Separator ',' occurs at position 4 which is character 2 of the last field: <<d,e>>. Consider setting 'comment.char=' if there is a trailing comment to be ignored.

The error message reports a problem on line 3, which is fine.

Now suppose b contains a line break:

src2 <- 'col1,col2
a,"b1 
b2"
c,d,e'

fread(src2)
# Error in fread(src2) : 
#   Line 3 from sampling jump 0 starting <<c,d,e>> has more than the expected 2 fields. Separator ',' occurs at position 4 which is character 2 of the last field: <<d,e>>. Consider setting 'comment.char=' if there is a trailing comment to be ignored.

The problem is reported to be on line 3, which is wrong.

This could seem harmless, but when reading a very large CSV with lots of records containing line breaks, the line number in the message is totally wrong, and we end up with no clue to fix the misformatted file.

sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2

st-pasha commented 7 years ago

Unfortunately it gets worse with the large file: during multi-chunk reading each chunk knows how to find the beginning of a "next good line", but has no idea what line number that is. As a result, it may report errors like "Expecting 3 cols but row 12 contains only 2 cols ...", where that 12 is relative to the current chunk not to the whole file (and therefore is less than useful).

mattdowle commented 6 years ago

Thanks @etienne-s for the great report. This has been recently fixed in dev. Here's new behaviour which looks good to me now.

> src1 <- 'col1,col2
+ a,b
+ c,d,e'
> fread(src1)
   col1 col2
1:    a    b
Warning message:
In fread(src1) : Discarded single-line footer: <<c,d,e>>

> src2 <- 'col1,col2
+ a,"b1 
+ b2"
+ c,d,e'
> fread(src2)
   col1    col2
1:    a b1 \nb2
Warning message:
In fread(src2) : Discarded single-line footer: <<c,d,e>>
>

Also, the line numbers on errors in large files, which @st-pasha pointed out was less than useful before, I've now fixed as well. There's now a concept of a thread being at the head position, which corresponds to the last row written to the result, so that thread knows how many rows have passed. Only the thread at the head position now stops the team and throws the error, including an accurate row number. But, it is a row number and not a line number. If there are no embedded newlines, then they are the same and the line number reported will be correct (even when multi-chunked). But, in the event of embedded newlines, the error won't ever report a line number that could accurately be used with head -n. The error does include quite a lot of the line in error so you greping for a substring from it can often work. Having said this i) if chunk boundaries fall within fields with many embedded newlines, it's much better handled now in dev. It does it's best and then ultimately the head position is now able to sweep up such awkward jump landings unambiguously since it has full lineage from the start of file. i.e. errors should not happen now due to jumping, so long as the file is valid when read from the start. and ii) on the todo list somewhere is, because I'd prefer not to try and track the number of newlines inside quoted fields (and therefore the true line number of error in that case), instead, fread could output the 2 or 3 lines either side of the line in error, automatically for you. Like a grep -A 2 -B 2 but automatically.

Please test latest dev again and open a new issue if any problems. I'm pretty sure it should be all fine now in this area. Thanks again.

st-pasha commented 6 years ago

I believe this is still a useful feature to have, and not very hard to implement after your work in #2627 . All that is needed is to increment the thread-local line counter in the string field parser when we check if (*ch=='\n') { ... }. These line counters would then propagate from thread to thread in the same way as myDTi counter does.

The @etienne-s example can better be given as

> fread('C1,C2\na,"b1\nb2"\nc,d,e\nf,g\nh,k\n')
   C1     C2
1:  a b1\nb2
Warning message:
In fread("C1,C2\na,\"b1\nb2\"\nc,d,e\nf,g\nh,k\n") :
  Stopped early on line 3. Expected 2 fields but found 3. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<c,d,e>>

Here the warning message says "stopped early on line 3 <<c,d,e>>", however this is the 4th line in the file. Of course, this example is trivial, but in a multi-GB file the discrepancies between lines/rows may reach tens of thousands making it hard to find the error.

etienne-s commented 6 years ago

Thanks @mattdowle!

I've tested again and my particular problem is solved in the new version, since the new warning message includes a large part of the faulty line. (The old message included only the first 10 characters, and it was not enough to locate the problem with grep.)

In some other cases it might be useful to have an exact line number reported in the message instead of a row number.

Rdatatable / data.table

Line numbers in error messages should refer to the raw file #2428