JuliaData / CSV.jl

Utility library for working with CSV and other delimited files in the Julia programming language
https://csv.juliadata.org/
Other
474 stars 142 forks source link

Undefined result when parsing tsv containing double quotes inside a field #1002

Open alchemyst opened 2 years ago

alchemyst commented 2 years ago

I came across this issue when I tried to analyse the IMDB dataset available here. I was seeing #undef in in my dataframe after using CSV.jl to read it.

I have narrowed down the problem from the original 8 million lines and 9 columns to this 500 line one column file which triggers the issue. The file is produced from the original IMDB file by the following command, which gives you insight into the exact lines and fields from the original which were used here for context.

< title.basics.tsv | sed -n -e 1p -e 32035,32537p | cut -d $'\t' -f 3 > test.tsv

Deleting any line from this file results a call to CSV.File("test.tsv") to fail with ERROR: MethodError: Cannot ``convert`` an object of type Missing to an object of type String. With this file, the call succeeds, but the last row contains undefined.

The code required to trigger this problem:

using CSV

titles = CSV.File("test.tsv");

titles[end] 

This results in

CSV.Row: Error showing value of type CSV.Row:
ERROR: UndefRefError: access to undefined reference

Full stacktrace shown at the end of this post

I've included the ; in case you want to run this in the REPL. This shows the actual read succeeds. Of course, the last line is previewed in the REPL and it also triggers the error.

I noticed that the line in question starts with a double quote (and is the first one which does that in this file), which led me to work around this issue by passing quoted=false to CSV.File which allowed me to read the file correctly.

This feels like a parse error to me and I think it should be reported as such while reading the file instead of silently succeeding and passing through undefined values. This is especially problematic because if you pass this through to DataFrame, you don't get any sense that there is something wrong until you try to do something with those particular rows.

Weirdly, when I tried to read the .gz that I had to upload now directly withCSV.File("test.tsv.gz"), I see lots of warnings, but these do not appear when reading the tsv itself.

Versions:

Stacktrace promised earlier:

julia> titles[end]
CSV.Row: Error showing value of type CSV.Row:
ERROR: UndefRefError: access to undefined reference
Stacktrace:
  [1] getindex(A::Vector{String}, i1::Int64)
    @ Base ./array.jl:861
  [2] getcolumn
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:24 [inlined]
  [3] (::Tables.var"#1#2"{CSV.Row})(nm::Symbol)
    @ Tables ./none:0
  [4] iterate
    @ ./generator.jl:47 [inlined]
  [5] collect(itr::Base.Generator{Vector{Symbol}, Tables.var"#1#2"{CSV.Row}})
    @ Base ./array.jl:724
  [6] _totuple
    @ ./tuple.jl:349 [inlined]
  [7] Tuple
    @ ./tuple.jl:317 [inlined]
  [8] NamedTuple(r::CSV.Row)
    @ Tables ~/.julia/packages/Tables/PxO1m/src/Tables.jl:195
  [9] show(io::IOContext{Base.TTY}, x::CSV.Row)
    @ Tables ~/.julia/packages/Tables/PxO1m/src/Tables.jl:201
 [10] show(io::IOContext{Base.TTY}, #unused#::MIME{Symbol("text/plain")}, x::CSV.Row)
    @ Base.Multimedia ./multimedia.jl:47
 [11] (::REPL.var"#43#44"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol("text/plain")}, Base.RefValue{Any}})(io::Any)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:266
 [12] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:510
 [13] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:259
 [14] display(d::REPL.REPLDisplay, x::Any)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:271
 [15] display(x::Any)
    @ Base.Multimedia ./multimedia.jl:328
 [16] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
 [17] invokelatest
    @ ./essentials.jl:714 [inlined]
 [18] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:293
 [19] (::REPL.var"#45#46"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:277
 [20] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:510
 [21] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:275
 [22] (::REPL.var"#do_respond#66"{Bool, Bool, REPL.var"#77#87"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:846
 [23] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
 [24] invokelatest
    @ ./essentials.jl:714 [inlined]
 [25] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/LineEdit.jl:2493
 [26] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:1232
 [27] (::REPL.var"#49#54"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL ./task.jl:423

test.tsv.gz

cocoa1231 commented 10 months ago

+1 I am also facing the same issue. My tsv is around 250mb so I can't upload but it can be downloaded from athena.ohdsi.org (SNOMED dataset). Julia 1.10.0 on Debian 12 with CSV v0.10.12 and DataFrames v1.6.1