Slight modifications to original datasets

JuliaStats / RDatasets.jl

Julia package for loading many of the data sets available in R

GNU General Public License v3.0

160 stars 56 forks source link

Slight modifications to original datasets #22

Closed nignatiadis closed 10 years ago

nignatiadis commented 10 years ago

Hi! I'd like to ask, if there are general guidelines in regards to the datasets that can be added to this repository. In particular:

1) A package implements its own class. An object of this class basically consists of some metadata and a dataframe. Of the included example datasets, I just want to add the corresponding dataframes (and not the metadata) to RDatasets.jl.

2) Using data("dataname") returns a list of 3 similar dataframes. Instead, I vertically merge those 3 dataframes and add an extra column to distinguish them.

Would such datasets be welcome, or should I refrain from adding them in such a form? And if I add them, how should the "modifications" be annotated?

(The package in question is adehabitatLT.)

johnmyleswhite commented 10 years ago

What's the use case? Why not just add 3 new datasets?

nignatiadis commented 10 years ago

It would simply feel more natural, since the 3 datasets capture exactly the same information (this is why the authors also combine them in a list).The same measurements are done for 3 animals of the same species using the same methods. Here is an example of such a list with 4 datasets:

> data(porpoise)
> sapply(porpoise, function(x) attr(x,"id"))
[1] "GUS"      "David"    "Mitchell" "Eric"

Thus it would feel more natural to append an :id column and vcat the dataframes, rather than create 4 dataframes (porpoise_GUS, porpoise_David,...). The use case is that you pretty much always would want to analyze those datasets together. Loading them all together as a vector/list of DataFrames as done in the R package would not be possible with the current RDatasets API.

Then again, decreasing the consistency across the RDatasets package for such a limited use case would also be very bad, which is why I am asking.

johnmyleswhite commented 10 years ago

Then again, decreasing the consistency across the RDatasets package for such a limited use case would also be very bad, which is why I am asking.

Yeah, I'm not sure this is the way to go. What you're suggesting would make RDatasets type-unstable, which is problematic in many different ways. The only solution that would be reasonable would be to always return a Dict, which would be overkill for every other dataset.

In general, it's really hard to look to R for inspiration for Julia since R frequently uses functions that aren't type-stable. R functions likes data also violate scope rules, which Julia will not allow.

nignatiadis commented 10 years ago

Well, appending the :id column and merging as one data frame would not make it type-unstable, but certainly less consistent.

And yeah, I guess this is one of the really great things about Julia. Thanks for the reply!

johnmyleswhite commented 10 years ago

I'd be happy to share these datasets as a merged whole with an additional ID column if you think enough people will want that.

nignatiadis commented 10 years ago

Well, I am not really sure how many people would want that. But.. there's at least 1 person :p who would. And the relevant publication (mentioned in CRAN) apparently has been cited 481 times.

I'll try to send the pull request sometime tonight.

Thanks again :).

garborg commented 10 years ago

Closed by #23