allisonhorst / palmerpenguins

A great intro dataset for data exploration & visualization (alternative to iris).
https://allisonhorst.github.io/palmerpenguins/
Creative Commons Zero v1.0 Universal
886 stars 209 forks source link

Include csv files in repository? #37

Closed eddelbuettel closed 4 years ago

eddelbuettel commented 4 years ago

Thanks for your efforts in providing this new data set as a standard. I just cloned the repo and noticed one thing missing that I wanted to use for an example: a stored csv.

One thing frequently shown when teaching data wrangling is taught is remote download from a URL just as you do here in your data-raw/ directory. And while the package is nicely set-up according to CRAN packaging standards and cleanly provides its data, it only provides to R users of the package which is more limiting than it could be and excludes other users.

Would you consider also writing the data as a csv file so that is could be slurped with a remote csv read? This would offer two benefits not currently covered. One is more minor: you can "standardize" on a file name by using one, so it will always be palmerpenguins.csv rather than some variant, and two, more importantly, you do not close the door to data science users not starting from an R package.

Disk space is reasonably cheap, and the vignettes/ directory alone is 3mb. The csv export of the data set I just made (for a demo use) clocks in at 14kb, or less that 1/2 of a percent. So we'd have the space, and I think we'd loose nothing by also offering a downloadable csv. I am more ambivalent of how to best ship it in a package. The data set is so small that I would probably include it as a csv but given that the whole LazyLoad machinery is set up there is no reason to change this. But having a download target csv would be a nice net gain for some users not currently reached. Thanks for your considerations.

eddelbuettel commented 4 years ago

Come to think about it there is at least one more reason. E.g. when we prepare Debian packages from CRAN packages we have to explain for each binary file (that is a .rda) where the source comes from. I did a quick check, and I appear to currently have around 138 Debian packages I maintain unpacked on my box containing a NAMESPACE file (as quick proxy for a CRAN package), and 51 of them require such a file! Here is an example for viridislite. Now, downstream packaging is not normative for CRAN or other best practices, but ... it would still be nice to have a csv for that reason alone. At least for some downstream packagers :)

markvanderloo commented 4 years ago

A csv would also make it readily available for our friends in the Python and Julia world.

btw: I'm already using it in the video I uploaded for my useR2020muc talk :+1:

eddelbuettel commented 4 years ago

A csv would also make it readily available for our friends in the Python and Julia world.

That is what I had in mind when I wrote "it only provides it to R users of the package which is more limiting than it could be and excludes other users" above. Or to R users who are trying to remotely slurp a csv file which has long been supported by R's Connections API.

allisonhorst commented 4 years ago

Thanks @eddelbuettel & @markvanderloo, we agree & will be adding the csv shortly.

eddelbuettel commented 4 years ago

Forgot to mention that should you need or want it I'd be happy to send a one-line PR to add the export to csv to the processing file....

catherinenelson1 commented 4 years ago

Hi, I was using the CSV file a few days ago to make a basic ML example in multiple programming languages. It would be great if you could put it back up! Thanks!

apreshill commented 4 years ago

Hi,

This file is back here now to stay: https://github.com/allisonhorst/palmerpenguins/blob/master/data-raw/penguins.csv

Thank you! Alison/Allison

catherinenelson1 commented 4 years ago

For anyone else who wondered where the csv file had moved to, it is now here: https://github.com/allisonhorst/palmerpenguins/tree/master/inst/extdata

eddelbuettel commented 4 years ago

Not yet on my box must I trust once updated, it will:


 R> system.file("extdata", "penguins.csv", package="palmerpenguins")                 
 [1] "" 
 R>  
``