JuliaData / CSV.jl

Utility library for working with CSV and other delimited files in the Julia programming language
https://csv.juliadata.org/
Other
474 stars 142 forks source link

Error reading CSV - missing lines #1096

Open Thiago-Simoes opened 1 year ago

Thiago-Simoes commented 1 year ago

I trying to read a CSV and I always thinked all was fine, but recently I noticed some lines missing. I don't understand why. I did manually a function to parse the CSV and worked, but using the package some lines are missed.

The file has 34034 lines, when reading using my function it returns a 34034 lines dataframe, but using package the dataframe has 33704. Almost 1% of lines have problems.

The file is attached, hope someone can help. File: fi.csv

jeremiedb commented 1 year ago

Loading of the file using both latest CSV (v.10.11) and an earlier release (v0.10.8) all result in the same 33 704 rows:

path = joinpath(@__DIR__, "fi.csv")
file = CSV.File(path; ntasks=1, rows_to_check=30000);

I noted however that the header row as well as several others had 33 ";" delimiters, while several other rows such as the second one had only 22. This seems to point to some inconsistency in the CSV data itself.

Liozou commented 1 year ago

The issue is the single quote mark at line 5741 of the csv (column 288). The next quote mark is on line 6071, so everything in between is considered to be a string, that accounts for a single field value... And for some reason it is silently converted to a missing value (which may be fixable here, if it actually is an issue?). So anyway, 6071 - 5741 == 330 lines are lost, which account for your missing lines.

To get the correct file, you can use the quoted=false option (e.g. file = CSV.File(path; quoted=false)) to simply ignore quote marks, or you can remove the offending quote mark at line 5741. I would also suggest removing the other single quote marks at line 3701, 24956 and 32356.

It's obviously bad that some lines can actually be "lost" by the parser, but of course it's very difficult to correctly handle incorrect data files... I don't know what should be done here.