Undefined result when parsing tsv containing double quotes inside a field

I came across this issue when I tried to analyse the IMDB dataset available here. I was seeing #undef in in my dataframe after using CSV.jl to read it.

I have narrowed down the problem from the original 8 million lines and 9 columns to this 500 line one column file which triggers the issue. The file is produced from the original IMDB file by the following command, which gives you insight into the exact lines and fields from the original which were used here for context.

< title.basics.tsv | sed -n -e 1p -e 32035,32537p | cut -d $'\t' -f 3 > test.tsv

Deleting any line from this file results a call to CSV.File("test.tsv") to fail with ERROR: MethodError: Cannot ``convert`` an object of type Missing to an object of type String. With this file, the call succeeds, but the last row contains undefined.

The code required to trigger this problem:

using CSV

titles = CSV.File("test.tsv");

titles[end]

This results in

CSV.Row: Error showing value of type CSV.Row:
ERROR: UndefRefError: access to undefined reference

Full stacktrace shown at the end of this post

I've included the ; in case you want to run this in the REPL. This shows the actual read succeeds. Of course, the last line is previewed in the REPL and it also triggers the error.

I noticed that the line in question starts with a double quote (and is the first one which does that in this file), which led me to work around this issue by passing quoted=false to CSV.File which allowed me to read the file correctly.

This feels like a parse error to me and I think it should be reported as such while reading the file instead of silently succeeding and passing through undefined values. This is especially problematic because if you pass this through to DataFrame, you don't get any sense that there is something wrong until you try to do something with those particular rows.

Weirdly, when I tried to read the .gz that I had to upload now directly withCSV.File("test.tsv.gz"), I see lots of warnings, but these do not appear when reading the tsv itself.

Versions:

Julia 1.7.2 on macOS Monterey 12.3.1, installed via Homebrew
CSV.jl v0.10.4

Stacktrace promised earlier:

julia> titles[end]
CSV.Row: Error showing value of type CSV.Row:
ERROR: UndefRefError: access to undefined reference
Stacktrace:
  [1] getindex(A::Vector{String}, i1::Int64)
    @ Base ./array.jl:861
  [2] getcolumn
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:24 [inlined]
  [3] (::Tables.var"#1#2"{CSV.Row})(nm::Symbol)
    @ Tables ./none:0
  [4] iterate
    @ ./generator.jl:47 [inlined]
  [5] collect(itr::Base.Generator{Vector{Symbol}, Tables.var"#1#2"{CSV.Row}})
    @ Base ./array.jl:724
  [6] _totuple
    @ ./tuple.jl:349 [inlined]
  [7] Tuple
    @ ./tuple.jl:317 [inlined]
  [8] NamedTuple(r::CSV.Row)
    @ Tables ~/.julia/packages/Tables/PxO1m/src/Tables.jl:195
  [9] show(io::IOContext{Base.TTY}, x::CSV.Row)
    @ Tables ~/.julia/packages/Tables/PxO1m/src/Tables.jl:201
 [10] show(io::IOContext{Base.TTY}, #unused#::MIME{Symbol("text/plain")}, x::CSV.Row)
    @ Base.Multimedia ./multimedia.jl:47
 [11] (::REPL.var"#43#44"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol("text/plain")}, Base.RefValue{Any}})(io::Any)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:266
 [12] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:510
 [13] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:259
 [14] display(d::REPL.REPLDisplay, x::Any)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:271
 [15] display(x::Any)
    @ Base.Multimedia ./multimedia.jl:328
 [16] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
 [17] invokelatest
    @ ./essentials.jl:714 [inlined]
 [18] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:293
 [19] (::REPL.var"#45#46"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:277
 [20] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:510
 [21] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:275
 [22] (::REPL.var"#do_respond#66"{Bool, Bool, REPL.var"#77#87"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:846
 [23] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
 [24] invokelatest
    @ ./essentials.jl:714 [inlined]
 [25] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/LineEdit.jl:2493
 [26] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:1232
 [27] (::REPL.var"#49#54"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL ./task.jl:423

test.tsv.gz

JuliaData / CSV.jl

Undefined result when parsing tsv containing double quotes inside a field #1002