JuliaData / CSV.jl

Utility library for working with CSV and other delimited files in the Julia programming language
https://csv.juliadata.org/
Other
467 stars 140 forks source link

Cannot round-trip a file (read, write, read) in some circumstances #1140

Open TimG1964 opened 1 week ago

TimG1964 commented 1 week ago

Refer to this discussion on the Julialang Discourse:

Can you file an issue against CSV.jl on GitHub? There’s probably a bug when the cut point to attribute parts of the file to tasks is in a particular position.

The error described there is

┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
ERROR: LoadError: TaskFailedException

    nested task error: CSV.Error("thread = 2 fatal error, encountered an invalidly quoted field while parsing around row = 175539, col = 3: \"\"I will undertake a research trip hosted by Michele Bryd-McPhee curator of ‘Ladies of Hip-Hop Festival’ in New York City in March and July 2018 with 3 fundamental areas of enquiry; \n\", error=INVALID: OK | QUOTED | EOF | INVALID_QUOTED_FIELD , check your `quotechar` arguments or manually fix the field in the file itself")
    Stacktrace:
     [1] fatalerror(buf::Vector{UInt8}, pos::Int64, len::Int64, code::Int16, row::Int64, col::Int64)
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:590
     [2] parsevalue!(::Type{String}, buf::Vector{UInt8}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context)
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:798
     [3] parserow
       @ C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:640 [inlined]
     [4] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{CSV.Column}, ::Type{Tuple{}})
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:550
     [5] multithreadparse(ctx::CSV.Context, pertaskcolumns::Vector{Vector{CSV.Column}}, rowchunkguess::Int64, i::Int64, rows::Vector{Int64}, wholecolumnslock::ReentrantLock)
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:360
     [6] (::CSV.var"#34#39"{CSV.Context, Vector{Vector{CSV.Column}}, Int64, Int64, Vector{Int64}, ReentrantLock})()
       @ CSV C:\Users\TGebbels\.julia\packages\WorkerUtilities\ey0fP\src\WorkerUtilities.jl:384
Stacktrace:
  [1] sync_end(c::Channel{Any})
    @ Base .\task.jl:448
  [2] macro expansion
    @ .\task.jl:480 [inlined]
  [3] CSV.File(ctx::CSV.Context, chunking::Bool)
    @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:240
  [4] File
    @ C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:227 [inlined]
  [5] #File#32
    @ C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:223 [inlined]
  [6] CSV.File(source::String)
    @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:162
  [7] read(source::String, sink::Type; copycols::Bool, kwargs::@Kwargs{})
    @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\CSV.jl:117
  [8] read(source::String, sink::Type)
    @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\CSV.jl:113
  [9] top-level scope
    @ c:\Users\TGebbels...\Documents\DCMS Database\CompareCsv.jl:361
 [10] include(fname::String)
    @ Base.MainInclude .\client.jl:489
 [11] run(debug_session::VSCodeDebugger.DebugAdapter.DebugSession, error_handler::VSCodeDebugger.var"#3#4"{String})
    @ VSCodeDebugger.DebugAdapter c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.105.2\scripts\packages\DebugAdapter\src\packagedef.jl:126
 [12] startdebugger()
    @ VSCodeDebugger c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.105.2\scripts\packages\VSCodeDebugger\src\VSCodeDebugger.jl:45
 [13] top-level scope
    @ c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.105.2\scripts\debugger\run_debugger.jl:12
 [14] include(mod::Module, _path::String)
    @ Base .\Base.jl:495
 [15] exec_options(opts::Base.JLOptions)
    @ Base .\client.jl:318
 [16] _start()
    @ Base .\client.jl:552
in expression starting at c:\Users\TGebbels\...\Documents\DCMS Database\CompareCsv.jl:361
nalimilan commented 1 week ago

@quinnj What's interesting is that the error doesn't happen when passing ntasks=1 to CSV.read.

TimG1964 commented 1 week ago

Is this the same as #1139 ?

nalimilan commented 1 week ago

Possibly, but hard to tell without having seen the files and/or identified the root cause.

TimG1964 commented 1 week ago

Files are public, from the UK Department of Culture, Media and Sport, here, or by HTTP.get call to https://nationallottery.dcms.gov.uk/api/v1/grants/csv-export/. Typically just over 300MB, but growing. Updates are relatively frequent as new grant records are added. At least one field, Description, is a quoted text field that sometimes contains new lines and can be quite lengthy. Only a quite small proportion of the 700,000 records contain new lines, though, unlike the file in #1139. This may be the reason the problem is intermittent and depends on sort order.

nalimilan commented 1 week ago

Ah sorry I hadn't noticed that #1139 includes code to generate the file.