Open kmichelson opened 3 years ago
The error is accurate, because if you open the file in a text editor, the line ends without the closing quote at
implementation of quality improvement;
and starts with:
; ","VARIATIONS IN IMPLEMENTATION OF QUALITY INTERVENTIONS:
where the third character i.e. the quote should have ideally been in the previous line. fread
is telling you that the file has been incorrectly created, and there's no non-human solution to it that a computer can follow through without ambiguity to resolve. fread
is handling it gracefully - telling you where the problem is and stopping, hoping you'll be able to fix it.
Excel also reads the file incorrectly for me:
I used Data > From Text/CSV and rechecked the PowerQuery configuration to ensure that the correct separator i.e. comma is used. Could you recheck how Excel handles the file in your specific system?
Interesting. My Excel appears to handle it a little better. I see why the line is ambiguous as written - I am proposing that fread could gracefully skip reading the line, throw a warning, and read the rest of the file.
Wouldn't skipping require forgoing two lines in this case (as the file is written)? It would be difficult for any system to correctly classify to ignore the second line as well as the first, and do so correctly in every situation. Just my two cents on that specific feature request.
I'm sure you are aware, but for the benefit of others coming across this thread - you can use fill = TRUE
, and manually filter out these two lines. Not the best solution, but accessible in case you can't edit the file.
I see your point and suppose there is not a generalizable solution that won't cause other problems. For what it's worth, fill=TRUE
caused R to crash for me (see below), so I ended up reading the file by saving from Excel and reading the resultant file, which worked without issue.
Now that appears to be a bug. This is the verbose output (didn't cause a crash for me):
Hmm interesting. This is my console output:
This installation of data.table has not been compiled with OpenMP support.
omp_get_num_procs() 1
R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
R_DATATABLE_NUM_THREADS unset
R_DATATABLE_THROTTLE unset (default 1024)
omp_get_thread_limit() 1
omp_get_max_threads() 1
OMP_THREAD_LIMIT unset
OMP_NUM_THREADS unset
RestoreAfterFork true
data.table is using 1 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 1 threads (omp_get_max_threads()=1, nth=1)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file /Users/kenmichelson/Desktop/RePORTER_PRJ_C_FY2016_new.csv
File opened, size = 161.1MB (168917200 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<"APPLICATION_ID","ACTIVITY","A>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 46 fields using quote rule 0
Detected 46 columns on line 1. This line is either column names or first data row. Line starts as: <<"APPLICATION_ID","ACTIVITY","A>>
Quote rule picked = 0
fill=true and the most number of columns found is 46
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 100 because (168917198 bytes from row 1 to eof) / (2 * 245154 jump0size) == 344
Type codes (jump 000) : 5CC5CCCC5CCCCCC5CCCCC55C5CC5CCCCCCCC5CC2C55552 Quote rule 0
Type codes (jump 001) : 5CC5CCCC5CCCCCC5CCCCC5CC5CC5CCCCCCCC5CC2C55552 Quote rule 0
Type codes (jump 002) : 5CC5CCCC5CCCCCC5CCCCC5CC5CCCCCCCCCCC5CC5C55555 Quote rule 0
Type codes (jump 025) : 5CC5CCCC7CCCCCC5CCCCC5CC5CCCCCCCCCCCCCC5C55555 Quote rule 0
Type codes (jump 036) : 5CC5CCCCCCCCCCC5CCCCC5CC5CCCCCCCCCCCCCC5C55555 Quote rule 0
Type codes (jump 100) : 5CC5CCCCCCCCCCC5CCCCC5CC5CCCCCCCCCCCCCC5C55555 Quote rule 0
'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 10052 sample rows
=====
Sampled 10052 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 2 to the end of last row: 168916517
Line length: mean=2321.10 sd=713.89 min=325 max=11325
Estimated number of rows: 168916517 / 2321.10 = 72775
Initial alloc = 145550 rows (72775 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 11 type and 0 drop user overrides : CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
[10] Allocate memory for the datatable
Allocating 46 column slots (46 - 0 dropped) with 145550 rows
[11] Read the data
jumps=[0..72), chunk_size=2346062, total_size=168916517
Restarting team from jump 27. nSwept==0 quoteRule==1
jumps=[27..72), chunk_size=2346062, total_size=168916517
Restarting team from jump 27. nSwept==0 quoteRule==2
jumps=[27..72), chunk_size=2346062, total_size=168916517
#
[Minimal reproducible example
]I first downloaded a list of federal grants: https://exporter.nih.gov/CSVs/final/RePORTER_PRJ_C_FY2016.zip The path to the extracted file was placed in the variable
csvfile
I then get the following warning that stops further reading:
When I opened the CSV in Excel, it had no trouble loading, and I could not detect any issues with the line. Even if there is a problem with the CSV line itself, fread should at least gracefully handle the problem and move on. Thanks for considering the issue.
#
Output of sessionInfo()
R version 4.1.0 (2021-05-18) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 10.16Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] lubridate_1.7.10 bit64_4.0.5 bit_4.0.4 stringr_1.4.0 data.table_1.14.0
loaded via a namespace (and not attached): [1] compiler_4.1.0 magrittr_2.0.1 generics_0.1.0 tools_4.1.0 Rcpp_1.0.6 tinytex_0.31
[7] stringi_1.6.2 xfun_0.23