Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.59k stars 978 forks source link

fread issue with quoted fields that contain line breaks #6275

Open StefanSchellerer opened 2 months ago

StefanSchellerer commented 2 months ago

When reading a csv-file, fread gives an unexpected output.

# example (";"-separated csv)
csv <- "first;second;third;fourth;fith;sixth;seventh;eighth;ninth\r\ntxt_r1c1;txt_r1_c2;\"txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3\r\ntxt_r1_c3\";txt_r1_c4;txt_r1_c5;txt_r1_c6;txt_r1_c7;txt_r1_c8;txt_r1_c9\r\ntxt_r2_c1;\"txt_r2_c2\";\"txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3\r\ntxt_r2_c3\";txt_r2_c4;txt_r2_c5;txt_r2_c6;txt_r2_c7;txt_r2_c8;txt_r2_c9\r\n"
data <- data.table::fread(csv)

gives unexpected output: column 3, which contains text (including ";" characters), is split up instead of read as a single text field

data
#>        first    second      third    fourth      fith     sixth   seventh eighth  ninth
#>       <char>    <char>     <char>    <char>    <char>    <char>    <char> <lgcl> <lgcl>
#> 1:   txt_r1c1 txt_r1_c2 "txt_r1_c3 txt_r1_c3 txt_r1_c3 txt_r1_c3 txt_r1_c3     NA     NA
#> 2: txt_r1_c3" txt_r1_c4  txt_r1_c5 txt_r1_c6 txt_r1_c7 txt_r1_c8 txt_r1_c9     NA     NA
#> 3:  txt_r2_c1 txt_r2_c2 "txt_r2_c3 txt_r2_c3 txt_r2_c3 txt_r2_c3 txt_r2_c3     NA     NA
#> 4: txt_r2_c3" txt_r2_c4  txt_r2_c5 txt_r2_c6 txt_r2_c7 txt_r2_c8 txt_r2_c9     NA     NA

When removing an entire column, the outcome is as expected:

# removing col 1
csv <- "second;third;fourth;fith;sixth;seventh;eighth;ninth\r\ntxt_r1_c2;\"txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3\r\ntxt_r1_c3\";txt_r1_c4;txt_r1_c5;txt_r1_c6;txt_r1_c7;txt_r1_c8;txt_r1_c9\r\n\"txt_r2_c2\";\"txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3\r\ntxt_r2_c3\";txt_r2_c4;txt_r2_c5;txt_r2_c6;txt_r2_c7;txt_r2_c8;txt_r2_c9\r\n"
data <- data.table::fread(csv)

data
#>      second                                                              third    fourth      fith     sixth   seventh    eighth     ninth
#>      <char>                                                             <char>    <char>    <char>    <char>    <char>    <char>    <char>
#> 1: txt_r1_c2 txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3\r\ntxt_r1_c3 txt_r1_c4 txt_r1_c5 txt_r1_c6 txt_r1_c7 txt_r1_c8 #> txt_r1_c9
#> 2: txt_r2_c2 txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3\r\ntxt_r2_c3 txt_r2_c4 txt_r2_c5 txt_r2_c6 txt_r2_c7 txt_r2_c8 txt_r2_c9

When removing a single ";" in the third column (text field), the outcome is as expected:

# removing ";" in column 3
csv <- "first;second;third;fourth;fith;sixth;seventh;eighth;ninth\r\ntxt_r1c1;txt_r1_c2;\"txt_r1_c3 txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3\r\ntxt_r1_c3\";txt_r1_c4;txt_r1_c5;txt_r1_c6;txt_r1_c7;txt_r1_c8;txt_r1_c9\r\ntxt_r2_c1;\"txt_r2_c2\";\"txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3\r\ntxt_r2_c3\";txt_r2_c4;txt_r2_c5;txt_r2_c6;txt_r2_c7;txt_r2_c8;txt_r2_c9\r\n"
data <- data.table::fread(csv)

data
#>       first    second                                                              third    fourth      fith     sixth   seventh    eighth     ninth
 #>     <char>    <char>                                                             <char>    <char>    <char>    <char>    <char>    <char>    <char>
#> 1:  txt_r1c1 txt_r1_c2  txt_r1_c3 txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3\r\ntxt_r1_c3 txt_r1_c4 txt_r1_c5 txt_r1_c6 txt_r1_c7 txt_r1_c8 txt_r1_c9
#> 2: txt_r2_c1 txt_r2_c2 txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3\r\ntxt_r2_c3 txt_r2_c4 txt_r2_c5 txt_r2_c6 txt_r2_c7 txt_r2_c8 txt_r2_c9

setting r verbose = TRUE in the first example, this is reported:

data <- data.table::fread(csv, verbose = T)
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            16
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          16
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 8 threads with throttle==1024. See ?setDTthreads.
freadR.c has been passed the data as text input (not a filename)
[01] Check arguments
  Using 8 threads (omp_get_max_threads()=16, nth=8)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  `input` argument is provided rather than a file name, interpreting as raw text to read
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<first;second;third;fourth;fith>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=';'  with 3 lines of 9 fields using quote rule 0
  sep=';'  with 4 lines of 7 fields using quote rule 2
  Detected 7 columns on line 1. This line is either column names or first data row. Line starts as: <<txt_r1c1;txt_r1_c2;"txt_r1_c3;>>
  Quote rule picked = 2
  fill=false and the most number of columns found is 7
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 1 because (297 bytes from row 1 to eof) / (2 * 297 jump0size) == 0
  Type codes (jump 000)    : DDDDDDD  Quote rule 2
Types in 1st data row match types in 2nd data row but previous row has 9 fields. Taking previous row as column names.  All rows were sampled since file is small so we know nrow=3 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : DDDDDDD11
[10] Allocate memory for the datatable
  Allocating 9 column slots (9 - 0 dropped) with 3 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=297
  Too few rows allocated. Allocating additional 1024 rows (now nrows=1027) and continue reading from jump 0
  jumps=[0..1), chunk_size=1048576, total_size=297
Read 4 rows x 9 columns from 356 bytes file in 00:00.001 wall clock time
[12] Finalizing the datatable
  Type counts:
         2 : bool8     '1'
         7 : string    'D'
=============================
   0.000s ( 20%) Memory map 0.000GB file
   0.001s ( 61%) sep=';' ncol=9 and header detection
   0.000s (  4%) Column type detection using 3 sample rows
   0.000s (  4%) Allocation of 1027 rows x 9 cols (0.000GB) of which 4 (  0%) rows used
   0.000s ( 11%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 4 rows) using 1 threads
   +    0.000s (  0%) Parse to row-major thread buffers (grown 0 times)
   +    0.000s (  0%) Transpose
   +    0.000s ( 10%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.001s        Total
Warnmeldungen:
1: In data.table::fread(csv, verbose = T) :
  Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
2: In data.table::fread(csv, verbose = T) :
  Detected 9 column names but the data has 7 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.
tdhock commented 2 months ago

I was unable to reproduce, can you please post your code and output in a markdown triple backtick block?

```r
your code
#> your output


Please also set `verbose=TRUE` to get more insight into how `fread()` is behaving.
StefanSchellerer commented 2 months ago

Thank you so much! I have edited the post. Can you reproduce the example now?

Additional info: I expect the third column to be read as a single field as it is given within \" \"txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3\r\ntxt_r1_c3\", however, in the example given above, the field is split across multiple columns. Using R's read.csv2 gives the expected output btw.

tdhock commented 2 months ago

I am able to reproduce

> csv_issue <- "first;second;third;fourth;fith;sixth;seventh;eighth;ninth\r\ntxt_r1c1;txt_r1_c2;\"txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3\r\ntxt_r1_c3\";txt_r1_c4;txt_r1_c5;txt_r1_c6;txt_r1_c7;txt_r1_c8;txt_r1_c9\r\ntxt_r2_c1;\"txt_r2_c2\";\"txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3\r\ntxt_r2_c3\";txt_r2_c4;txt_r2_c5;txt_r2_c6;txt_r2_c7;txt_r2_c8;txt_r2_c9\r\n"
> (dt_issue <- data.table::fread(csv_issue))
        first    second      third    fourth      fith     sixth   seventh
       <char>    <char>     <char>    <char>    <char>    <char>    <char>
1:   txt_r1c1 txt_r1_c2 "txt_r1_c3 txt_r1_c3 txt_r1_c3 txt_r1_c3 txt_r1_c3
2: txt_r1_c3" txt_r1_c4  txt_r1_c5 txt_r1_c6 txt_r1_c7 txt_r1_c8 txt_r1_c9
3:  txt_r2_c1 txt_r2_c2 "txt_r2_c3 txt_r2_c3 txt_r2_c3 txt_r2_c3 txt_r2_c3
4: txt_r2_c3" txt_r2_c4  txt_r2_c5 txt_r2_c6 txt_r2_c7 txt_r2_c8 txt_r2_c9
   eighth  ninth
   <lgcl> <lgcl>
1:     NA     NA
2:     NA     NA
3:     NA     NA
4:     NA     NA
Warning messages:
1: In data.table::fread(csv_issue) :
  Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
2: In data.table::fread(csv_issue) :
  Detected 9 column names but the data has 7 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.

> csv_rm1 <- "second;third;fourth;fith;sixth;seventh;eighth;ninth\r\ntxt_r1_c2;\"txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3\r\ntxt_r1_c3\";txt_r1_c4;txt_r1_c5;txt_r1_c6;txt_r1_c7;txt_r1_c8;txt_r1_c9\r\n\"txt_r2_c2\";\"txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3\r\ntxt_r2_c3\";txt_r2_c4;txt_r2_c5;txt_r2_c6;txt_r2_c7;txt_r2_c8;txt_r2_c9\r\n"
> (dt_rm1 <- data.table::fread(csv_rm1))
      second                                                              third
      <char>                                                             <char>
1: txt_r1_c2 txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3\r\ntxt_r1_c3
2: txt_r2_c2 txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3\r\ntxt_r2_c3
      fourth      fith     sixth   seventh    eighth     ninth
      <char>    <char>    <char>    <char>    <char>    <char>
1: txt_r1_c4 txt_r1_c5 txt_r1_c6 txt_r1_c7 txt_r1_c8 txt_r1_c9
2: txt_r2_c4 txt_r2_c5 txt_r2_c6 txt_r2_c7 txt_r2_c8 txt_r2_c9

> csv_rm3 <- "first;second;third;fourth;fith;sixth;seventh;eighth;ninth\r\ntxt_r1c1;txt_r1_c2;\"txt_r1_c3 txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3\r\ntxt_r1_c3\";txt_r1_c4;txt_r1_c5;txt_r1_c6;txt_r1_c7;txt_r1_c8;txt_r1_c9\r\ntxt_r2_c1;\"txt_r2_c2\";\"txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3\r\ntxt_r2_c3\";txt_r2_c4;txt_r2_c5;txt_r2_c6;txt_r2_c7;txt_r2_c8;txt_r2_c9\r\n"
> (dt_rm3 <- data.table::fread(csv_rm3))
       first    second
      <char>    <char>
1:  txt_r1c1 txt_r1_c2
2: txt_r2_c1 txt_r2_c2
                                                                third    fourth
                                                               <char>    <char>
1:  txt_r1_c3 txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3\r\ntxt_r1_c3 txt_r1_c4
2: txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3\r\ntxt_r2_c3 txt_r2_c4
        fith     sixth   seventh    eighth     ninth
      <char>    <char>    <char>    <char>    <char>
1: txt_r1_c5 txt_r1_c6 txt_r1_c7 txt_r1_c8 txt_r1_c9
2: txt_r2_c5 txt_r2_c6 txt_r2_c7 txt_r2_c8 txt_r2_c9

please avoid repeating the same variable names with different values (csv, data), and instead to avoid confusion, please use different names for different values (csv_rm3, csv_rm1, etc).

Below we see the data

> cat(csv_issue)
first;second;third;fourth;fith;sixth;seventh;eighth;ninth
txt_r1c1;txt_r1_c2;"txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3
txt_r1_c3";txt_r1_c4;txt_r1_c5;txt_r1_c6;txt_r1_c7;txt_r1_c8;txt_r1_c9
txt_r2_c1;"txt_r2_c2";"txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3
txt_r2_c3";txt_r2_c4;txt_r2_c5;txt_r2_c6;txt_r2_c7;txt_r2_c8;txt_r2_c9
> cat(csv_rm1)
second;third;fourth;fith;sixth;seventh;eighth;ninth
txt_r1_c2;"txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3
txt_r1_c3";txt_r1_c4;txt_r1_c5;txt_r1_c6;txt_r1_c7;txt_r1_c8;txt_r1_c9
"txt_r2_c2";"txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3
txt_r2_c3";txt_r2_c4;txt_r2_c5;txt_r2_c6;txt_r2_c7;txt_r2_c8;txt_r2_c9
> cat(csv_rm3)
first;second;third;fourth;fith;sixth;seventh;eighth;ninth
txt_r1c1;txt_r1_c2;"txt_r1_c3 txt_r1_c3; txt_r1_c3; txt_r1_c3; txt_r1_c3
txt_r1_c3";txt_r1_c4;txt_r1_c5;txt_r1_c6;txt_r1_c7;txt_r1_c8;txt_r1_c9
txt_r2_c1;"txt_r2_c2";"txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3; txt_r2_c3
txt_r2_c3";txt_r2_c4;txt_r2_c5;txt_r2_c6;txt_r2_c7;txt_r2_c8;txt_r2_c9

The issue title could be more specific. This looks like an issue with quoted fields that contain line breaks.

StefanSchellerer commented 2 months ago

Thanks! I'll keep it in mind. What do you suggest? Is any further input from my side necessary?