Circumvent TranscodingStream position issue

JuliaStats / RDatasets.jl

Julia package for loading many of the data sets available in R

GNU General Public License v3.0

159 stars 56 forks source link

Circumvent TranscodingStream position issue #62

Closed laborg closed 5 years ago

laborg commented 5 years ago

By completely decompressing gzipped datasets into a string we bypass the current problem, caused by TranscodingStreams not having all IOStream functions defined (e.g position or seek).

Until this is implemented (see https://github.com/bicycle1885/TranscodingStreams.jl/issues/62) this PR fixes:

https://github.com/johnmyleswhite/RDatasets.jl/issues/61

and helps Gadfly as mentioned here:

https://github.com/bicycle1885/CodecZlib.jl/issues/32

Additionally I've fixed the Ecdat::Mofa dataset, so now all tests are passing again on Julia 1.0.1.

bjarthur commented 5 years ago

might this fail for some future dataset whose string can't all fit into memory? i'm not sure, but the current code seems to lazily read it in.

laborg commented 5 years ago

Yes of course. The question is how likely this is going to be:

The largest CSV dataset is 9mb (uncompressed)
The last data set has been added two years ago

Telling the future is not possible, but my guess would be that not many huge CSV datasets will be added to RDatasets in the next couple of years. From my point of view this negligible risk is way better than having a myriad of examples in published tutorials/books/packages/etc fail.

randyzwitch commented 5 years ago

I agree that worrying about out-of-memory errors in this package is unlikely. As it is, the package works well as a collection of well-known datasets, but I don't see it growing in size as it's not intended as a dumping ground for all future datasets.

I'll leave this open for comments for another few days, then merge if there are no other objections.

alejandromerchan commented 5 years ago

This should be merged, despite the mentioned risk. I, personally, have been a little stalled by this and would like a quick fix. I also realize that once the TranscodingStreams.jl - CSV.jl issue is solved, you could change this back to the old behavior.

tlnagy commented 5 years ago

Hey @randyzwitch, any chance we can get this merged and tagged? It's been breaking Gadfly's tests for weeks now.