Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 986 forks source link

fread from v.1.11.0+ no longer reads the .csv correctly, which was read perfectly in v.1.10.4-3 #2857

Open mihcis opened 6 years ago

mihcis commented 6 years ago

I've encountered a possible bug in the newer versions of data.table. I have a 2GB .csv file with c. 3 million rows and 67 columns. I can use fread() to read it all fine from data.table v.1.10.4-3, but v.1.11.0+ terminates at a row somewhere in the middle.

The base read.csv() also hits the same problem, while fread() from v.1.10.4-3 worked perfectly.

I really like data.table and want to help troubleshoot this, but obviously I can't upload the 2GB data file anywhere as it is too big and private.

I can upload a few rows though, but I don't know a way of splicing the .csv without reading it into R. I am keen to get any help though in order to create a reproducible example.

The verbose output is here.

The sessionInfo() is here.

st-pasha commented 6 years ago

The warning message says

Stopped early on line 138986. Expected 67 fields but found 22. Consider fill=TRUE.
First discarded non-empty line: 
...

First thing I'd suggest to do is to open your file in a text editor and look at line 138986 (the counter may be slightly off, so might be better to search for the line shown). If you don't have a way to open a 2GB text file, try fread-ing it with sep=NULL (this is equivalent to readLines()). Then extract the relevant lines (e.g. DT[138980:138990]) and see what the problem is with line 138986

msummersgill commented 6 years ago

Do you have access to Windows Powershell? If so, the command below will extract the header and a handful of lines where the verbose output reported the error.

Get-Content '.\Data\FP17s with TCD in March 2018.csv' | Select -Index (0,138985,138986,138987,138988,138989) > 'FileSubset.csv'
mihcis commented 6 years ago

Thanks for your suggestions. I tried using Notepad++ 64bit to look at the problematic place. The line that is supposed to be one row is broken into 3 lines.

However, there is still an issue with the older data.table version being able to handle this in a smart way and read the data correctly, and the newer version cannot.

Without sharing the 2GB data file, I still cannot provide a reproducible example (it is also private data). But I can provide this video to show what really happens.

I tried @msummersgill 's solution to splice the header and a few rows around the problematic place. Trying to read this file now, both data.table versions give an error, but each a different one.

v.1.10.4-3 gives:

Error in fread("spliced.csv", na.strings = c("#EMPTY", "")) : embedded nul in string: 'ÿþC\0o\0n\0t\0r\0a\0c\0t\0 \0N\0u\0m\0b\0e\0r\0 \0a\0n\0d\0 \0T\0a\0g\0'

v.1.11.2 gives:

Error in fread("spliced.csv", na.strings = c("#EMPTY", "")) : File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.

If I recode to UTF-8 as suggested (although whatever the encoding, it worked in the original and must have been changed when the spliced file was created via Windows Powershell; EDIT: the original file was encoded in UTF-8!), two versions give two different errors again:

v.1.10.4-3 gives:

Expected sep (',') but new line, EOF (or other non printing character) ends field 21 when detecting types >from point 0: 916439/0001,Q69,GDS Contract,MR A SYED,916439,Mr,SYED A >Mr,A,SYED,58955,3718.00,Nine Mile Ride Dental Practice,Dental Surgery,193 Nine Mile >Ride,Finchampstead,WOKINGHAM,RG40 4JD,2181233168.00,TORIN,FASTNEDGE,1 ANCHORITE CLOSE,

v.1.11.2 gives:

In fread("spliced2.csv", na.strings = c("#EMPTY", "")) : Stopped early on line 7. Expected 67 fields but found 22. Consider fill=TRUE and comment.char=. First >discarded non-empty line: <<916439/0001,Q69,GDS Contract,MR A SYED,916439,Mr,SYED A >Mr,A,SYED,58955,3718.00,Nine Mile Ride Dental Practice,Dental Surgery,193 Nine Mile >Ride,Finchampstead,WOKINGHAM,RG40 4JD,2181233168.00,TORIN,FASTNEDGE,1 ANCHORITE >CLOSE,>>

In summary, I still cannot provide a reproducible example. The fact remains that the old fread from v.1.10.4-3 can read the file perfectly fine, despite the problematic point in the file where one line is split into three, whereas the newer v.1.11.0+ can no longer do it.

mattdowle commented 6 years ago

This has been raised on Stack Overflow, simultaneously. I had answered there, wasting time. https://stackoverflow.com/a/50260429/403310

Why are you still not passing fill=TRUE and investigating from there. As the warning message clearly suggests and was presented 3 hours ago.

mihcis commented 6 years ago

@mattdowle I have tried using fill=TRUE unsuccessfully.

With fill=TRUE v.1.11.0+ reads 3+ million rows, instead of 138985 with fill=FALSE. But, all headers and rows are read incorrectly and shifted - screenshot.

This is still a different behaviour from v.1.10.4-3, which reads all 3+ million rows correctly (with fill=FALSE). This is how it looks when read correctly.

mattdowle commented 6 years ago

I'm annoyed because you had a really really good warning message, but did not even put it in the original stack overflow question. It would have been better if you had started off on the right foot by saying something like 'I really appreciate the nice warning message that you guys carefully wrote, but I think 1.10.4-3 was actually handling this too-short line, whereas 1.11.0 does not'. Instead you didn't even mention there was a warning message. We have duplicated and wasted effort now in responding to you about the warning message because you didn't mention it. It came across that fread was really quite unhelpful and you had no idea where the problem was, whereas actually it's quite helpful with that warning message.

Thanks for stating that the warning's suggestion of fill=TRUE does not work.

It looks from the line in the warning message, that it ends with a field containing a postal address. The last character in that field is a comma followed by the newline: 1 ANCHORITE CLOSE,. That newline is probably supposed to be part of the address? If fields contain newline they are supposed to be quoted. fread is supposed to switch to an auto-healing rule in this case, with warning that you have an invalid file. Perhaps there is a bug in that it is no longer switching quoteRule out-of-sample. Does this sound plausible that it's an address field containing newlines that has not been quoted?

Now that you've opened the file in Notepad++ 64bit, are you able to trim it down to create a reproducible example? Include 3 lines before the bad one plus the column names. Then go through the 5-10 lines and change the data randomly so as to obfuscate it. Change letters to other letters. Change numbers to other numbers. Drop columns until it starts working and then put back that column. Then you have a small reproducible example we can look at. It could be quotes, embedded NUL, many things really. My guess is that this will take 5 - 30 mins.

However, it may start working if your small file is under 100 lines, because the first 100 lines are used to detect quote rule in a different part of the code. It should not matter, it should be able to switch quote rule out-of-sample, but maybe that is where the bug is. If that's the case we'll need a file with more than 100 lines to reproduce.

At this point it's probably too hard for you to create and obfuscate the data too. We could create the reproducible example ourselves, then. Just need you to confirm if you have an address field containing newlines and that field has not been quoted, please.

If this is the case: comma (not unusual sep) separated file containing a field containing newlines that is not quoted (i.e. invalid), what software wrote this file in the first place ... can you fix the writer? Since, if that is the case, that is where the root problem is.

st-pasha commented 6 years ago

@mihcis My hypothesis is that you have a case of a \r\n-terminated file where one of the fields accidentally contains a single \n which is not quoted. For example:

> require(data.table)
data.table 1.10.4.3
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> fread("A,B,C\r\n1,2,3\r\n4,\n5,6\r\n7,8,9\r\n")
       A     B     C
   <int> <int> <int>
1:     1     2     3
2:     4     5     6
3:     7     8     9

(restart R  and reinstall data.table)

> require(data.table)
Loading required package: data.table
data.table 1.11.1 IN DEVELOPMENT built 2018-05-09 22:33:37 UTC; pasha
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> fread("A,B,C\r\n1,2,3\r\n4,\n5,6\r\n7,8,9\r\n")
       A     B     C
   <int> <int> <int>
1:     1     2     3
Warning message:
In fread("A,B,C\r\n1,2,3\r\n4,\n5,6\r\n7,8,9\r\n") :
  Stopped early on line 3. Expected 3 fields but found 2. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<4,>>

You can confirm whether this is indeed the root of your problem by checking the verbose output of fread in data.table 1.10.4.3 (as far as I can see you only posted verbose output for 1.11.1)

mattdowle commented 6 years ago

@st-pasha Is your hypothesis different to mine? Here's what I wrote :

It looks from the line in the warning message, that it ends with a field containing a postal address. The last character in that field is a comma followed by the newline: 1 ANCHORITE CLOSE,. That newline is probably supposed to be part of the address? If fields contain newline they are supposed to be quoted.

st-pasha commented 6 years ago

@mattdowle Yes.

As you surely noticed, the OP has tried many different things to approach this problem, and he is certainly aware from his screenshot that the line in question is broken by stray newlines. I surmise that he was perfectly able to delete those newlines in the text editor and then read the file correctly. Still, he continues to engage with this issue because

I really like data.table and want to help troubleshoot this.

The real question is therefore this: just how was data.table 1.10.4.3 able to read the file correctly (despite not having any special line-healing abilities), while data.table 1.11.1 cannot? Is this a regression, or was it somehow a bug in 1.10.4.3?

If you re-read my message above, you'll see that:

mihcis commented 6 years ago

Thank you both for your replies @mattdowle and @st-pasha . I was not able to reply earlier as I guess we are in different time belts.

@st-pasha here are the verbose outputs from 1.10.4-3 and 1.11.2 for comparison.

Indeed, I finddata.table incredibly useful in my work, but was alarmed when I updated it to the latest version and could no longer read my data that I have previously worked with just fine.

When this happened, I tried rolling back the versions one by one before it worked in 1.10.4-3. So the issue affects 1.11.0, 1.11.1 and 1.11.2 (or 1.11.0+ as I wrote originally).

@st-pasha I am not an expert, but your example certainly seems to mimick what happens with my 2GB data.

I am not sure why that one line is broken in the CSV. The CSV was created by exporting the data from an ancient piece of software we use at work (SAP Business Objects), which is a client for an SQL-based data warehouse. There might well have been rogue "newline" symbols introduced at the end of some cells and then exported incorrectly.

If I would have used read.csv() I would still have encountered a problem. This is what the data looks like when read with read.csv() - screenshot. The last good row is where the practice name is Murgelas Practice Management Limited. The case after that is read in as 3 rows - exactly as seen when opening the data in Notepad++. It is surprising that fread() from 1.10.4-3 was able to read the case correctly as one row.

Yesterday I have tried splicing ~10 rows around the problematic place using Windows Powershell as suggested above - and the old 1.10.4-3 could no longer read it (although with a different error message from 1.11.2). Today, I have tried the same but using Notepad++. This actually worked and I managed to trim the file down to 12 cases, which reproduces the issue! I have kept: (1) the header, (2) the problematic row split into 3 + a few rows on either side AND (3) the last two rows in the file + the ending. I noticed that the file ends with an empty line, so wanted to preserve it in case it is important.

Please find the file here. (Use the Download button at the top right corner.) And let me know what you find.

jangorecki commented 6 years ago

ancient piece of software we use at work (SAP Business Objects), which is a client for an SQL-based data warehouse.

I know this combination well, I remember it was very annoying to see meaningless java error messages thrown randomly when using reports. Did you try to see how this cell (which has new line in it) is displayed on report?

There might well have been rogue "newline" symbols introduced at the end of some cells and then exported incorrectly.

What I think is more likely that user input forms captures new line properly - when it was provided by someone writing address field - while SAP csv extract mistakenly interpreted new line symbol in that cell, where new line symbol is still part of data so cell should be quoted.

mihcis commented 6 years ago

Did you try to see how this cell (which has new line in it) is displayed on report?

No, I try to avoid using it as much as possible. After I ran the query, I exported it as CSV straight away to work with in R, and then purged the data from SAP.

What I think is more likely that user input forms captures new line properly - when it was provided by > someone writing address field - while SAP csv extract mistakenly interpreted new line symbol in that cell, where new line symbol is still part of data so cell should be quoted.

I agree that the root problem is in the way SAP created the CSV. The point is, I was blissfully unaware of any problems when I was using the older data.table. And then when I realized there was a problem, I thought, "Wow! How smart is data.table if it can handle crap CSVs made by junk software", and then... "Why was this smart capability lost from the newer version?"

st-pasha commented 6 years ago

Thanks @mihcis, based on your sample I can see that the hexdump of the problematic place is exactly as I suspected:

31 20 41 4E 43 48 4F 52 49 54 45 20 43 4C 4F 53 45 2C 0A 54 57 59 46 4F 52 44 2C 0A 42 45 52 4B 53 2C
1  _  A  N  C  H  O  R  I  T  E  _  C  L  O  S  E  ,  \n T  W  Y  F  O  R  D  ,  \n B  E  R  K  S  ,

So you can see there are single \n characters here, whereas the rest of your file uses \r\n (Windows default) separators.

The reason it used to work before is because fread 1.10.4.3 had a "detect newline" step, which found that newline consists of \r + \n, and therefore was looking for a \r character to locate the beginning of a newline. It was somewhat buggy in places too (for example, in some places \r + any character was considered a newline). This is why when it encountered a single \n, it thought that it was a regular character.

New version of fread is more versatile: it doesn't have newline detection step anymore, and instead considers a newline anything that looks like a newline to a regular text editor (there is however a special exception for standalone \r characters). This helps to parse files where newlines got accidentally mixed up, but unfortunately it means that some invalid files with mixed newlines that used to be parseable now throw errors...

mihcis commented 6 years ago

@st-pasha Thank you for the explanation. It looks like I'm stuck with having to revert to 1.10.4.3 whenever I hit this problem. As I've been using 1.10.4.3 or older most of the time, I don't know if that is a one-off occurrence with the way SAP created the CSV, or regular. I'll have to test more files.

Do you think it would be a good idea though to keep the old behaviour as an optional argument? E.g.

fread('data.csv', newline="rn") and fread('data.csv', newline="n") (default)

miderxi commented 5 years ago

This has been raised on Stack Overflow, simultaneously. I had answered there, wasting time. https://stackoverflow.com/a/50260429/403310

Why are you still not passing fill=TRUE and investigating from there. As the warning message clearly suggests and was presented 3 hours ago.

before your say,i try i way the fead provide for me,it work