JuliaStats / RDatasets.jl

Julia package for loading many of the data sets available in R
GNU General Public License v3.0
159 stars 56 forks source link

Can't open Gzipped dataset #61

Closed ohadle closed 5 years ago

ohadle commented 5 years ago
julia> dataset("MASS", "Boston")
ERROR: MethodError: no method matching position(::TranscodingStream{CodecZlib.GzipDecompressor,IOStream})
Closest candidates are:
  position(::IOStream) at iostream.jl:188
  position(::Base.Libc.FILE) at libc.jl:101
  position(::Base.Filesystem.File) at filesystem.jl:225
  ...
Stacktrace:
 [1] consumeBOM!(::TranscodingStream{CodecZlib.GzipDecompressor,IOStream}) at C:\Users\ohadl\.julia\packages\CSV\uLyo0\src\CSV.jl:209
 [2] #File#1(::Int64, ::Bool, ::Int64, ::Nothing, ::Int64, ::Nothing, ::Bool, ::Nothing, ::Bool, ::Array{String,1}, ::String, ::Char, ::Bool, ::Char, ::Nothing, ::Nothing, ::Char, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Dict{Type,Type}, ::Symbol, ::Bool, ::Bool, ::Bool, ::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:rows_for_type_detect,),Tuple{Int64}}}, ::Type, ::TranscodingStream{CodecZlib.GzipDecompressor,IOStream}) at C:\Users\ohadl\.julia\packages\CSV\uLyo0\src\CSV.jl:142
 [3] (::getfield(Core, Symbol("#kw#Type")))(::NamedTuple{(:delim, :quotechar, :missingstring, :rows_for_type_detect),Tuple{Char,Char,String,Int64}}, ::Type{CSV.File}, ::TranscodingStream{CodecZlib.GzipDecompressor,IOStream}) at .\none:0
 [4] #read#101(::Bool, ::Dict{Int64,Function}, ::Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:delim, :quotechar, :missingstring, :rows_for_type_detect),Tuple{Char,Char,String,Int64}}}, ::Function, ::TranscodingStream{CodecZlib.GzipDecompressor,IOStream}, ::Type) at C:\Users\ohadl\.julia\packages\CSV\uLyo0\src\CSV.jl:304
 [5] (::getfield(CSV, Symbol("#kw##read")))(::NamedTuple{(:delim, :quotechar, :missingstring, :rows_for_type_detect),Tuple{Char,Char,String,Int64}}, ::typeof(CSV.read), ::TranscodingStream{CodecZlib.GzipDecompressor,IOStream}, ::Type) at .\none:0 (repeats 2 times)
 [6] (::getfield(RDatasets, Symbol("##1#2")){String,String})(::TranscodingStream{CodecZlib.GzipDecompressor,IOStream}) at C:\Users\ohadl\.julia\packages\RDatasets\mvYPU\src\dataset.jl:27
 [7] open(::getfield(RDatasets, Symbol("##1#2")){String,String}, ::Type{TranscodingStream{CodecZlib.GzipDecompressor,S} where S<:IO}, ::String, ::String) at C:\Users\ohadl\.julia\packages\TranscodingStreams\SaPZ8\src\stream.jl:157
 [8] dataset(::String, ::String) at C:\Users\ohadl\.julia\packages\RDatasets\mvYPU\src\dataset.jl:26
 [9] top-level scope at none:0

Adding various packages didn't help. Maybe an API changed? This looks similar.

laborg commented 5 years ago

I've looked into the source code and the problem boils down to TranscodingStreams.jl not supporting position and seek (other methods are available though: seekstart, mark, reset).

So either position and seek get implemented there (corresponding issue: https://github.com/bicycle1885/TranscodingStreams.jl/issues/62) or Parsers.jl and CSV.jl change their implementation to use only whats also available on all IO objects. I guess the former would be better.

alejandromerchan commented 5 years ago

I had a similar problem with some after the CSV update. My solution, which feels hacky but works, was to encansulate the io in CSV.read(IOBuffer(read(io))). Interestingly, after this, I don't seem to have "type-detect" issues, which were very common in my data. Don't know if that works here, because I was working with Zipped files, but if someone wants to try. I might do a test in a local branch later today.