JuliaStats / RDatasets.jl

Julia package for loading many of the data sets available in R
GNU General Public License v3.0
159 stars 56 forks source link

Update to DataFrames v0.11, CSV 0.2.1, RData 0.3 #48

Closed alyst closed 6 years ago

alyst commented 6 years ago

A part of JuliaStats/DataFrames.jl#1232. Still have to update .yml so that CI loads master versions of DataFrames, CSV and RData.

randyzwitch commented 6 years ago

Thanks @alyst. When this is ready for merging, let me know

alyst commented 6 years ago

@quinnj I've switched from readtable() to CSV.read(), but now I've got errors like:

DataStreams.Data.NullException("encountered a null value for a non-null column type on row = 6, col = 1")
 Stacktrace:
   [1] (::CSV.##4#5)(::Int64, ::Int64) at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:124
   [2] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:233 [inlined]
   [3] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:315 [inlined]
   [4] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:130 [inlined] (repeats 2 times)
   [5] streamfrom at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:205 [inlined]
   [6] macro expansion at /fs/home/<>/.julia/v0.6/DataStreams/src/DataStreams.jl:272 [inlined]
   [7] stream!(::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Nulls.Null}, ::Type{DataStreams.Data.Field}, ::DataFrames.DataFrameStream{Tuple{CategoricalArrays.CategoricalArray{String,1,UInt32,String,Union{}},WeakRefStrings.WeakRefStringArray{Union{Nulls.Null, WeakRefString{UInt8}},1}}}, ::DataStreams.Data.Schema{true,Tuple{CategoricalArrays.CategoricalValue{String,UInt32},Union{Nulls.Null, WeakRefString{UInt8}}}}, ::Int64, ::Tuple{Base.#identity,Base.#identity}, ::DataStreams.Data.##15#16, ::Array{Any,1}) at /fs/home/<>/.julia/v0.6/DataStreams/src/DataStreams.jl:344
   [8] #stream!#17(::Bool, ::Dict{Int64,Function}, ::Function, ::Array{Any,1}, ::Array{Any,1}, ::Function, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Nulls.Null}, ::Type{DataFrames.DataFrame}) at /fs/home/<>/.julia/v0.6/DataStreams/src/DataStreams.jl:222
   [9] (::DataStreams.Data.#kw##stream!)(::Array{Any,1}, ::DataStreams.Data.#stream!, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Nulls.Null}, ::Type{DataFrames.DataFrame}) at ./<missing>:0
   [10] #read#39(::Bool, ::Dict{Int64,Function}, ::Bool, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:311
   [11] dataset(::String, ::String) at /fs/home/<>/.julia/v0.6/RDatasets/src/dataset.jl:13
   [12] macro expansion at /fs/gpfs03/lv06/pool/pool-innate-bioinfo2/<>/.julia/v0.6/RDatasets/test/dataset.jl:15 [inlined]
...

or

CSV.ParsingException("Unexpected start of quote (34), use \"9234\" to type \"34\"")
  Stacktrace:
   [1] readsplitline!(::Array{CSV.RawField,1}, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::UInt8, ::UInt8, ::UInt8, ::Base.AbstractIOBuffer{Array{UInt8,1}}) at /fs/home/<>/.julia/v0.6/CSV/src/io.jl:98
   [2] #Source#22(::String, ::CSV.Options{Nulls.Null}, ::Int64, ::Int64, ::Array{Type,1}, ::Nulls.Null, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T} where T) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:99
   [3] (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0
   [4] #Source#21(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{Type,1}, ::Nulls.Null, ::Nulls.Null, ::UInt8, ::String, ::String, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T} where T, ::String) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:27
   [5] CSV.Source(::String) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:24
   [6] #read#39(::Bool, ::Dict{Int64,Function}, ::Bool, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:310
   [7] dataset(::String, ::String) at /fs/home/<>/.julia/v0.6/RDatasets/src/dataset.jl:13
   [8] macro expansion at /fs/gpfs03/lv06/pool/pool-innate-bioinfo2/<>/.julia/v0.6/RDatasets/test/dataset.jl:15 [inlined]
...

or

ArgumentError: Symbol name may not contain \0
 Stacktrace:
   [1] macro expansion at ./broadcast.jl:153 [inlined]
   [2] macro expansion at ./simdloop.jl:73 [inlined]
   [3] macro expansion at ./broadcast.jl:147 [inlined]
   [4] _broadcast! at ./broadcast.jl:139 [inlined]
   [5] broadcast_t at ./broadcast.jl:268 [inlined]
   [6] broadcast_c at ./broadcast.jl:314 [inlined]
   [7] broadcast at ./broadcast.jl:434 [inlined]
   [8] close!(::DataFrames.DataFrameStream{Tuple{CategoricalArrays.CategoricalArray{String,1,UInt32,String,Union{}}}}) at /fs/home/<>/.julia/v0.6/DataFrames/src/abstractdataframe/io.jl:302
   [9] #read#39(::Bool, ::Dict{Int64,Function}, ::Bool, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:312
   [10] dataset(::String, ::String) at /fs/home/<>/.julia/v0.6/RDatasets/src/dataset.jl:13
   [11] macro expansion at /fs/gpfs03/lv06/pool/pool-innate-bioinfo2/<>/.julia/v0.6/RDatasets/test/dataset.jl:15 [inlined]
...

Maybe you have an easy fix/explanations to those.

alyst commented 6 years ago

I've failed to resist the temptation of caching the datasets() and packages() tables in global variables. I'm not sure, though, the way I implemented it follows the recommended code pattern.

quinnj commented 6 years ago

@alyst, can you list the individual files you saw the errors on? Happy to take a look.

alyst commented 6 years ago

The csv.gz issues listed above are resolved. By default I'm using 200 rows for type detection now, and there are some datasets that need more rows to properly infer types. 12 datasets are failing, but mostly because null are encountered beyond 200-th row. So far the only dataset that requires fixing JuliaData/CSV.jl#64 is datasets/attenu.csv.gz, the inferred type of Station column is CategoricalArrays.CategoricalValue{String,UInt32}:

DataStreams.Data.NullException("encountered a null value for a non-null column type on row = 79, col = 3")
  Stacktrace:
   [1] (::CSV.##4#5)(::Int64, ::Int64) at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:124
   [2] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:235 [inlined]
   [3] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:317 [inlined]
   [4] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:131 [inlined] (repeats 2 times)
   [5] streamfrom at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:216 [inlined]

(the CSV.jl line numbers could be off, because I've made some changes to support all-nulls columns):

alyst commented 6 years ago

Some datasets need JuliaData/CSV.jl#115 to load properly

tlnagy commented 6 years ago

Thanks for working on this. I believe RDatasets is only package that needs to be updated before we can patch up Gadfly to work on top of Dataframes v0.11

alyst commented 6 years ago

I'm waiting for JuliaData/CSV.jl#115 to remove [WIP] and consider merging. I hope that PR would be reviewed in a few days.

randyzwitch commented 6 years ago

Just ping me when this is ready for merging

Nosferican commented 6 years ago

@alyst JuliaData/CSV.jl#115 has been merged.

randyzwitch commented 6 years ago

Even with the PR being merged, it still needs to be tagged. And that tag should be the minimum version in REQUIRES here.

Nosferican commented 6 years ago

@quinnj could you tag a minor release? It would help with updating RDatasets and Gadfly.

alyst commented 6 years ago

@Nosferican Thanks! Actually, I haven't forgotten this PR at all :) Besides JuliaData/CSV.jl#115 there are a few other outstanding issues (notably JuliaData/CSV.jl#132), which we are trying to resolve. Once it's done and CI is green for DataFrames and CSV, I'll update this PR and it's ok to merge.

quinnj commented 6 years ago

A new tag of CSV was just merged.

randyzwitch commented 6 years ago

@alyst Travis fails overall because it still mentions 0.4 and 0.5, but 0.6 passes. If you could please modify .travis.yml to just test 0.6 and master, then we should be able to tag a new version of RDatasets

alyst commented 6 years ago

@randyzwitch I hope to have a look tomorrow.

randyzwitch commented 6 years ago

Thanks to @StoneCypher, this is now passing on 0.6 as well. Up to you @alyst if there are more updates you want to do, or if I should merge this in and tag a new release.

tlnagy commented 6 years ago

It would be great to have this merged and a new release tagged as soon as possible. Thanks everyone for getting this up and running. (esp @alyst!)

alyst commented 6 years ago

I've updated REQUIRE, removed enclosing modules from tests (now all test.jl files are executed in their own testset) and finally removed [WIP].

randyzwitch commented 6 years ago

https://github.com/JuliaLang/METADATA.jl/pull/12836