I came across this issue when I tried to analyse the IMDB dataset available here. I was seeing #undef in in my dataframe after using CSV.jl to read it.
I have narrowed down the problem from the original 8 million lines and 9 columns to this 500 line one column file which triggers the issue. The file is produced from the original IMDB file by the following command, which gives you insight into the exact lines and fields from the original which were used here for context.
Deleting any line from this file results a call to CSV.File("test.tsv") to fail with ERROR: MethodError: Cannot ``convert`` an object of type Missing to an object of type String. With this file, the call succeeds, but the last row contains undefined.
The code required to trigger this problem:
using CSV
titles = CSV.File("test.tsv");
titles[end]
This results in
CSV.Row: Error showing value of type CSV.Row:
ERROR: UndefRefError: access to undefined reference
Full stacktrace shown at the end of this post
I've included the ; in case you want to run this in the REPL. This shows the actual read succeeds. Of course, the last line is previewed in the REPL and it also triggers the error.
I noticed that the line in question starts with a double quote (and is the first one which does that in this file), which led me to work around this issue by passing quoted=false to CSV.File which allowed me to read the file correctly.
This feels like a parse error to me and I think it should be reported as such while reading the file instead of silently succeeding and passing through undefined values. This is especially problematic because if you pass this through to DataFrame, you don't get any sense that there is something wrong until you try to do something with those particular rows.
Weirdly, when I tried to read the .gz that I had to upload now directly withCSV.File("test.tsv.gz"), I see lots of warnings, but these do not appear when reading the tsv itself.
Versions:
Julia 1.7.2 on macOS Monterey 12.3.1, installed via Homebrew
+1 I am also facing the same issue. My tsv is around 250mb so I can't upload but it can be downloaded from athena.ohdsi.org (SNOMED dataset). Julia 1.10.0 on Debian 12 with CSV v0.10.12 and DataFrames v1.6.1
I came across this issue when I tried to analyse the IMDB dataset available here. I was seeing
#undef
in in my dataframe after using CSV.jl to read it.I have narrowed down the problem from the original 8 million lines and 9 columns to this 500 line one column file which triggers the issue. The file is produced from the original IMDB file by the following command, which gives you insight into the exact lines and fields from the original which were used here for context.
Deleting any line from this file results a call to
CSV.File("test.tsv")
to fail withERROR: MethodError: Cannot ``convert`` an object of type Missing to an object of type String
. With this file, the call succeeds, but the last row containsundefined
.The code required to trigger this problem:
This results in
Full stacktrace shown at the end of this post
I've included the
;
in case you want to run this in the REPL. This shows the actual read succeeds. Of course, the last line is previewed in the REPL and it also triggers the error.I noticed that the line in question starts with a double quote (and is the first one which does that in this file), which led me to work around this issue by passing
quoted=false
toCSV.File
which allowed me to read the file correctly.This feels like a parse error to me and I think it should be reported as such while reading the file instead of silently succeeding and passing through undefined values. This is especially problematic because if you pass this through to DataFrame, you don't get any sense that there is something wrong until you try to do something with those particular rows.
Weirdly, when I tried to read the .gz that I had to upload now directly with
CSV.File("test.tsv.gz")
, I see lots of warnings, but these do not appear when reading the tsv itself.Versions:
Stacktrace promised earlier:
test.tsv.gz