Closed laborg closed 5 years ago
might this fail for some future dataset whose string can't all fit into memory? i'm not sure, but the current code seems to lazily read it in.
Yes of course. The question is how likely this is going to be:
Telling the future is not possible, but my guess would be that not many huge CSV datasets will be added to RDatasets in the next couple of years. From my point of view this negligible risk is way better than having a myriad of examples in published tutorials/books/packages/etc fail.
I agree that worrying about out-of-memory errors in this package is unlikely. As it is, the package works well as a collection of well-known datasets, but I don't see it growing in size as it's not intended as a dumping ground for all future datasets.
I'll leave this open for comments for another few days, then merge if there are no other objections.
This should be merged, despite the mentioned risk. I, personally, have been a little stalled by this and would like a quick fix. I also realize that once the TranscodingStreams.jl - CSV.jl issue is solved, you could change this back to the old behavior.
Hey @randyzwitch, any chance we can get this merged and tagged? It's been breaking Gadfly's tests for weeks now.
By completely decompressing gzipped datasets into a string we bypass the current problem, caused by TranscodingStreams not having all IOStream functions defined (e.g
position
orseek
).Until this is implemented (see https://github.com/bicycle1885/TranscodingStreams.jl/issues/62) this PR fixes:
and helps Gadfly as mentioned here:
Additionally I've fixed the Ecdat::Mofa dataset, so now all tests are passing again on Julia 1.0.1.