Closed alyst closed 6 years ago
Thanks @alyst. When this is ready for merging, let me know
@quinnj I've switched from readtable()
to CSV.read()
, but now I've got errors like:
DataStreams.Data.NullException("encountered a null value for a non-null column type on row = 6, col = 1")
Stacktrace:
[1] (::CSV.##4#5)(::Int64, ::Int64) at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:124
[2] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:233 [inlined]
[3] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:315 [inlined]
[4] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:130 [inlined] (repeats 2 times)
[5] streamfrom at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:205 [inlined]
[6] macro expansion at /fs/home/<>/.julia/v0.6/DataStreams/src/DataStreams.jl:272 [inlined]
[7] stream!(::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Nulls.Null}, ::Type{DataStreams.Data.Field}, ::DataFrames.DataFrameStream{Tuple{CategoricalArrays.CategoricalArray{String,1,UInt32,String,Union{}},WeakRefStrings.WeakRefStringArray{Union{Nulls.Null, WeakRefString{UInt8}},1}}}, ::DataStreams.Data.Schema{true,Tuple{CategoricalArrays.CategoricalValue{String,UInt32},Union{Nulls.Null, WeakRefString{UInt8}}}}, ::Int64, ::Tuple{Base.#identity,Base.#identity}, ::DataStreams.Data.##15#16, ::Array{Any,1}) at /fs/home/<>/.julia/v0.6/DataStreams/src/DataStreams.jl:344
[8] #stream!#17(::Bool, ::Dict{Int64,Function}, ::Function, ::Array{Any,1}, ::Array{Any,1}, ::Function, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Nulls.Null}, ::Type{DataFrames.DataFrame}) at /fs/home/<>/.julia/v0.6/DataStreams/src/DataStreams.jl:222
[9] (::DataStreams.Data.#kw##stream!)(::Array{Any,1}, ::DataStreams.Data.#stream!, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Nulls.Null}, ::Type{DataFrames.DataFrame}) at ./<missing>:0
[10] #read#39(::Bool, ::Dict{Int64,Function}, ::Bool, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:311
[11] dataset(::String, ::String) at /fs/home/<>/.julia/v0.6/RDatasets/src/dataset.jl:13
[12] macro expansion at /fs/gpfs03/lv06/pool/pool-innate-bioinfo2/<>/.julia/v0.6/RDatasets/test/dataset.jl:15 [inlined]
...
or
CSV.ParsingException("Unexpected start of quote (34), use \"9234\" to type \"34\"")
Stacktrace:
[1] readsplitline!(::Array{CSV.RawField,1}, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::UInt8, ::UInt8, ::UInt8, ::Base.AbstractIOBuffer{Array{UInt8,1}}) at /fs/home/<>/.julia/v0.6/CSV/src/io.jl:98
[2] #Source#22(::String, ::CSV.Options{Nulls.Null}, ::Int64, ::Int64, ::Array{Type,1}, ::Nulls.Null, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T} where T) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:99
[3] (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0
[4] #Source#21(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{Type,1}, ::Nulls.Null, ::Nulls.Null, ::UInt8, ::String, ::String, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T} where T, ::String) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:27
[5] CSV.Source(::String) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:24
[6] #read#39(::Bool, ::Dict{Int64,Function}, ::Bool, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:310
[7] dataset(::String, ::String) at /fs/home/<>/.julia/v0.6/RDatasets/src/dataset.jl:13
[8] macro expansion at /fs/gpfs03/lv06/pool/pool-innate-bioinfo2/<>/.julia/v0.6/RDatasets/test/dataset.jl:15 [inlined]
...
or
ArgumentError: Symbol name may not contain \0
Stacktrace:
[1] macro expansion at ./broadcast.jl:153 [inlined]
[2] macro expansion at ./simdloop.jl:73 [inlined]
[3] macro expansion at ./broadcast.jl:147 [inlined]
[4] _broadcast! at ./broadcast.jl:139 [inlined]
[5] broadcast_t at ./broadcast.jl:268 [inlined]
[6] broadcast_c at ./broadcast.jl:314 [inlined]
[7] broadcast at ./broadcast.jl:434 [inlined]
[8] close!(::DataFrames.DataFrameStream{Tuple{CategoricalArrays.CategoricalArray{String,1,UInt32,String,Union{}}}}) at /fs/home/<>/.julia/v0.6/DataFrames/src/abstractdataframe/io.jl:302
[9] #read#39(::Bool, ::Dict{Int64,Function}, ::Bool, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:312
[10] dataset(::String, ::String) at /fs/home/<>/.julia/v0.6/RDatasets/src/dataset.jl:13
[11] macro expansion at /fs/gpfs03/lv06/pool/pool-innate-bioinfo2/<>/.julia/v0.6/RDatasets/test/dataset.jl:15 [inlined]
...
Maybe you have an easy fix/explanations to those.
I've failed to resist the temptation of caching the datasets()
and packages()
tables in global variables. I'm not sure, though, the way I implemented it follows the recommended code pattern.
@alyst, can you list the individual files you saw the errors on? Happy to take a look.
The csv.gz
issues listed above are resolved. By default I'm using 200 rows for type detection now, and there are some datasets that need more rows to properly infer types.
12 datasets are failing, but mostly because null are encountered beyond 200-th row.
So far the only dataset that requires fixing JuliaData/CSV.jl#64 is datasets/attenu.csv.gz
, the inferred type of Station
column is CategoricalArrays.CategoricalValue{String,UInt32}
:
DataStreams.Data.NullException("encountered a null value for a non-null column type on row = 79, col = 3")
Stacktrace:
[1] (::CSV.##4#5)(::Int64, ::Int64) at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:124
[2] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:235 [inlined]
[3] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:317 [inlined]
[4] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:131 [inlined] (repeats 2 times)
[5] streamfrom at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:216 [inlined]
(the CSV.jl line numbers could be off, because I've made some changes to support all-nulls columns):
Some datasets need JuliaData/CSV.jl#115 to load properly
Thanks for working on this. I believe RDatasets is only package that needs to be updated before we can patch up Gadfly to work on top of Dataframes v0.11
I'm waiting for JuliaData/CSV.jl#115 to remove [WIP] and consider merging. I hope that PR would be reviewed in a few days.
Just ping me when this is ready for merging
@alyst JuliaData/CSV.jl#115 has been merged.
Even with the PR being merged, it still needs to be tagged. And that tag should be the minimum version in REQUIRES here.
@quinnj could you tag a minor release? It would help with updating RDatasets and Gadfly.
@Nosferican Thanks! Actually, I haven't forgotten this PR at all :) Besides JuliaData/CSV.jl#115 there are a few other outstanding issues (notably JuliaData/CSV.jl#132), which we are trying to resolve. Once it's done and CI is green for DataFrames and CSV, I'll update this PR and it's ok to merge.
A new tag of CSV was just merged.
@alyst Travis fails overall because it still mentions 0.4 and 0.5, but 0.6 passes. If you could please modify .travis.yml to just test 0.6 and master, then we should be able to tag a new version of RDatasets
@randyzwitch I hope to have a look tomorrow.
Thanks to @StoneCypher, this is now passing on 0.6 as well. Up to you @alyst if there are more updates you want to do, or if I should merge this in and tag a new release.
It would be great to have this merged and a new release tagged as soon as possible. Thanks everyone for getting this up and running. (esp @alyst!)
I've updated REQUIRE, removed enclosing modules from tests (now all test.jl files are executed in their own testset) and finally removed [WIP].
A part of JuliaStats/DataFrames.jl#1232. Still have to update
.yml
so that CI loads master versions of DataFrames, CSV and RData.