Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 987 forks source link

fread error on specific CSV #5110

Open kmichelson opened 3 years ago

kmichelson commented 3 years ago

# [Minimal reproducible example]

I first downloaded a list of federal grants: https://exporter.nih.gov/CSVs/final/RePORTER_PRJ_C_FY2016.zip The path to the extracted file was placed in the variable csvfile

  csvdata<-fread(csvfile,colClasses="character")

I then get the following warning that stops further reading:

1: In fread(csv, colClasses = "character") :
  Stopped early on line 42369. Expected 46 fields but found 38. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<"100239","","AF","","","","09/30/2016","09/29/2021","n/a","","N/A","","HHSP233201500059I/HHSP23337002T","ACF:5356081\","","2016","Administration for Children and Families","","NEW YORK","UNITED STATES","N/A","12","075258780","US","1633601","MDRC","NY","100164328","",";  (contact); ","BURCHINAL, MARGARET; HSUEH, JOANN  (contact);","MARTINEZ-BECK, IVELISSE","09/30/2016","09/29/2021","early care and education quality improvement; quality of ECE ; child outcomes; preschool-aged children; implementat>>

When I opened the CSV in Excel, it had no trouble loading, and I could not detect any issues with the line. Even if there is a problem with the CSV line itself, fread should at least gracefully handle the problem and move on. Thanks for considering the issue.

# Output of sessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 10.16

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] lubridate_1.7.10 bit64_4.0.5 bit_4.0.4 stringr_1.4.0 data.table_1.14.0

loaded via a namespace (and not attached): [1] compiler_4.1.0 magrittr_2.0.1 generics_0.1.0 tools_4.1.0 Rcpp_1.0.6 tinytex_0.31
[7] stringi_1.6.2 xfun_0.23

avimallu commented 3 years ago

The error is accurate, because if you open the file in a text editor, the line ends without the closing quote at

implementation of quality improvement;

and starts with:

; ","VARIATIONS IN IMPLEMENTATION OF QUALITY INTERVENTIONS:

where the third character i.e. the quote should have ideally been in the previous line. fread is telling you that the file has been incorrectly created, and there's no non-human solution to it that a computer can follow through without ambiguity to resolve. fread is handling it gracefully - telling you where the problem is and stopping, hoping you'll be able to fix it.

Excel also reads the file incorrectly for me:

image

I used Data > From Text/CSV and rechecked the PowerQuery configuration to ensure that the correct separator i.e. comma is used. Could you recheck how Excel handles the file in your specific system?

kmichelson commented 3 years ago

Interesting. My Excel appears to handle it a little better. I see why the line is ambiguous as written - I am proposing that fread could gracefully skip reading the line, throw a warning, and read the rest of the file.

Screen Shot 2021-08-23 at 12 08 00 PM
avimallu commented 3 years ago

Wouldn't skipping require forgoing two lines in this case (as the file is written)? It would be difficult for any system to correctly classify to ignore the second line as well as the first, and do so correctly in every situation. Just my two cents on that specific feature request.

I'm sure you are aware, but for the benefit of others coming across this thread - you can use fill = TRUE, and manually filter out these two lines. Not the best solution, but accessible in case you can't edit the file.

kmichelson commented 3 years ago

I see your point and suppose there is not a generalizable solution that won't cause other problems. For what it's worth, fill=TRUE caused R to crash for me (see below), so I ended up reading the file by saving from Excel and reading the resultant file, which worked without issue.

Screen Shot 2021-08-23 at 12 17 50 PM
avimallu commented 3 years ago

Now that appears to be a bug. This is the verbose output (didn't cause a crash for me):

Click to view ``` > dat <- fread("RePORTER_PRJ_C_FY2016_new.csv", fill = TRUE, verbose = TRUE) OpenMP version (_OPENMP) 201511 omp_get_num_procs() 12 R_DATATABLE_NUM_PROCS_PERCENT unset (default 50) R_DATATABLE_NUM_THREADS unset R_DATATABLE_THROTTLE unset (default 1024) omp_get_thread_limit() 2147483647 omp_get_max_threads() 12 OMP_THREAD_LIMIT unset OMP_NUM_THREADS unset RestoreAfterFork true data.table is using 6 threads with throttle==1024. See ?setDTthreads. freadR.c has been passed a filename: RePORTER_PRJ_C_FY2016_new.csv [01] Check arguments Using 6 threads (omp_get_max_threads()=12, nth=6) NAstrings = [<>] None of the NAstrings look like numbers. show progress = 1 0/1 column will be read as integer [02] Opening the file Opening file RePORTER_PRJ_C_FY2016_new.csv File opened, size = 161.1MB (168917200 bytes). Memory mapped ok [03] Detect and skip BOM [04] Arrange mmap to be \0 terminated \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal. [05] Skipping initial rows if needed Positioned on line 1 starting: <<"APPLICATION_ID","ACTIVITY","A>> [06] Detect separator, quoting rule, and ncolumns Detecting sep automatically ... sep=',' with 46 fields using quote rule 0 Detected 46 columns on line 1. This line is either column names or first data row. Line starts as: <<"APPLICATION_ID","ACTIVITY","A>> Quote rule picked = 0 fill=true and the most number of columns found is 46 [07] Detect column types, good nrow estimate and whether first row is column names Number of sampling jump points = 100 because (168917198 bytes from row 1 to eof) / (2 * 245154 jump0size) == 344 Type codes (jump 000) : 5CC5CCCC5CCCCCC5CCCCC55C5CC5CCCCCCCC5CC2C55552 Quote rule 0 Type codes (jump 001) : 5CC5CCCC5CCCCCC5CCCCC5CC5CC5CCCCCCCC5CC2C55552 Quote rule 0 Type codes (jump 002) : 5CC5CCCC5CCCCCC5CCCCC5CC5CCCCCCCCCCC5CC5C55555 Quote rule 0 Type codes (jump 025) : 5CC5CCCC7CCCCCC5CCCCC5CC5CCCCCCCCCCCCCC5C55555 Quote rule 0 Type codes (jump 036) : 5CC5CCCCCCCCCCC5CCCCC5CC5CCCCCCCCCCCCCC5C55555 Quote rule 0 Type codes (jump 100) : 5CC5CCCCCCCCCCC5CCCCC5CC5CCCCCCCCCCCCCC5C55555 Quote rule 0 'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 10052 sample rows ===== Sampled 10052 rows (handled \n inside quoted fields) at 101 jump points Bytes from first data row on line 2 to the end of last row: 168916517 Line length: mean=2321.10 sd=713.89 min=325 max=11325 Estimated number of rows: 168916517 / 2321.10 = 72775 Initial alloc = 145550 rows (72775 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn] ===== [08] Assign column names [09] Apply user overrides on column types After 0 type and 0 drop user overrides : 5CC5CCCCCCCCCCC5CCCCC5CC5CCCCCCCCCCCCCC5C55555 [10] Allocate memory for the datatable Allocating 46 column slots (46 - 0 dropped) with 145550 rows [11] Read the data jumps=[0..72), chunk_size=2346062, total_size=168916517 Restarting team from jump 27. nSwept==0 quoteRule==1 jumps=[27..72), chunk_size=2346062, total_size=168916517 Restarting team from jump 27. nSwept==0 quoteRule==2 jumps=[27..72), chunk_size=2346062, total_size=168916517 4 out-of-sample type bumps: CCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5C55555 jumps=[0..72), chunk_size=2346062, total_size=168916517 |--------------------------------------------------| |=================================================Error in fread("RePORTER_PRJ_C_FY2016_new.csv", fill = TRUE, verbose = TRUE) : attempt to set index 72493/72493 in SET_STRING_ELT ```
kmichelson commented 3 years ago

Hmm interesting. This is my console output:

This installation of data.table has not been compiled with OpenMP support.
  omp_get_num_procs()            1
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         1
  omp_get_max_threads()          1
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 1 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 1 threads (omp_get_max_threads()=1, nth=1)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file /Users/kenmichelson/Desktop/RePORTER_PRJ_C_FY2016_new.csv
  File opened, size = 161.1MB (168917200 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<"APPLICATION_ID","ACTIVITY","A>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 46 fields using quote rule 0
  Detected 46 columns on line 1. This line is either column names or first data row. Line starts as: <<"APPLICATION_ID","ACTIVITY","A>>
  Quote rule picked = 0
  fill=true and the most number of columns found is 46
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (168917198 bytes from row 1 to eof) / (2 * 245154 jump0size) == 344
  Type codes (jump 000)    : 5CC5CCCC5CCCCCC5CCCCC55C5CC5CCCCCCCC5CC2C55552  Quote rule 0
  Type codes (jump 001)    : 5CC5CCCC5CCCCCC5CCCCC5CC5CC5CCCCCCCC5CC2C55552  Quote rule 0
  Type codes (jump 002)    : 5CC5CCCC5CCCCCC5CCCCC5CC5CCCCCCCCCCC5CC5C55555  Quote rule 0
  Type codes (jump 025)    : 5CC5CCCC7CCCCCC5CCCCC5CC5CCCCCCCCCCCCCC5C55555  Quote rule 0
  Type codes (jump 036)    : 5CC5CCCCCCCCCCC5CCCCC5CC5CCCCCCCCCCCCCC5C55555  Quote rule 0
  Type codes (jump 100)    : 5CC5CCCCCCCCCCC5CCCCC5CC5CCCCCCCCCCCCCC5C55555  Quote rule 0
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 10052 sample rows
  =====
  Sampled 10052 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 168916517
  Line length: mean=2321.10 sd=713.89 min=325 max=11325
  Estimated number of rows: 168916517 / 2321.10 = 72775
  Initial alloc = 145550 rows (72775 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 11 type and 0 drop user overrides : CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
[10] Allocate memory for the datatable
  Allocating 46 column slots (46 - 0 dropped) with 145550 rows
[11] Read the data
  jumps=[0..72), chunk_size=2346062, total_size=168916517
  Restarting team from jump 27. nSwept==0 quoteRule==1
  jumps=[27..72), chunk_size=2346062, total_size=168916517
  Restarting team from jump 27. nSwept==0 quoteRule==2
  jumps=[27..72), chunk_size=2346062, total_size=168916517