JuliaStats / RDatasets.jl

Julia package for loading many of the data sets available in R
GNU General Public License v3.0
160 stars 56 forks source link

Clean up and add datasets, improve descriptive functions. Fixes #6, fixes #13, fixes #15. #17

Closed garborg closed 10 years ago

garborg commented 10 years ago

Enacts most suggestions that have been made regarding the package:

Add datasets that had been added to Vincent's repo. Remove rownames unless they add information. Colnames are cleaned in R so that doesn't have to happen on read. Combat over-factorization of strings, underuse of integers in R. Save data with ('valid') factors as .rda when possible. Remove all attributes that break read_rda (functions, formulas, NULL).

garborg commented 10 years ago

Added an .R script so any future refinements can be applied with ease, but I ran comparisons and basically everything is the same or nicer than before. The possible exception I noticed is that row.names (that are kept) are now left as strings in the .rdas, so for .rdas that were formerly .csvs, readtable's typing was probably an improvement in some cases.

johnmyleswhite commented 10 years ago

I totally forgot that this PR existed. I actually just spent two hours making many of the same changes, but there's not perfect overlap. If you don't mind, it would be great if you could see if there's anything of value in the changes I made that isn't superceded by your work. If not, we can just revert my changes and use your changes instead.

The one possible case in which my work is useful relative to your more automated approach is that I renamed many columns by hand, which fixed not only formatting issues, but also some misspellings in the raw datasets.

garborg commented 10 years ago

Definitely, I'll take a look.

garborg commented 10 years ago

I combined our work in #18.

johnmyleswhite commented 10 years ago

Thank you so much for doing this. Hopefully that resolves most of the remaining annoyances with this package.

garborg commented 10 years ago

Of course. Combined with the cleaned colnames and new show methods for DataFrames, the datasets look a lot cleaner, too.