IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 102 forks source link

R for Data Science data and other examples for R-Instat help #3831

Open rdstern opened 7 years ago

rdstern commented 7 years ago

The Malawi workshop will also follow - to some extent the R for Data Science book by Hadley Wickham. We have our own datasets, but I also want to use this opportunity to look into the data used in this book. It would be good to be able to (at least) repeat the analyses he does in the first 2 parts of the book - that are on descriptive statistics and data wrangling.

And we could do worse than use these sets also among our own regular testing data.

So his initial chapter just uses 2 sets from the ggplot2 package, namely diamonds and mpg.
He also uses data from a package (which is now included) called nycflights13. This is a package of just 5 datasets maintained by Hadley Wickham. These 5 datasets include one on weather, which could be interesting to us in its own right. It is hourly data for one year for 3 locations. (Because problems with flights are often related to weather.) It is organised in exactly the format we need for our climatic analyses. We don't yet have anything special for within-day data - but we will. The other 4 datasets are all more obviously linked - so we might want to make them into a single (Instat Object) RDS file and store them additionally as that!

  1. flights has details (19 columns) of 336,776 flights in 2013. Information includes the origin and destination airports.
  2. airports has details about the 1458 airports. So could link nicely? Except that I would like to link columns origin and also dest in the flights data frame, to my key column (faa) in the airports dataset. That doesn't seem possible yet.
  3. planes has details of the 3322 planes for the flights. These do link, because the column tailnum is in both data frames. My first checking of the keys/links dialogues!
  4. airlines has details of the 16 airlines in the dataset. Set up the key and link to the flights. This is also fine.
  5. Then tried to do a summary of the delay of the flights for each airline. It should detect the links and add to the small airline data frame. I did N, N not missing, min, mean, max. It worked well (link worked), but most of the results were NA (for min, mean, max). So used the option to ignore missing and got the results I needed.
  6. I suggest this is already an interesting analysis that would not be trivial in R "command mode". In R-Instat it worked ok in terms of time for the calculations. I know this is no longer a large data set, but my machine is not new and 300,000 rows of data to summarise is also not trivial!

My suggestion above that we save the dataset as an Instat object would make this demonstration even simpler!

rdstern commented 7 years ago

The second package added is the Lahman data on baseball. Of course very American. I am not so interested in this except it is (apparently) also available as an Access database. I wonder how we can import database data and set up the keys and links at the same time. This is our challenge with CLIMSOFT already. One interesting overview data frame is called LahmanData. This provides details of all the other 24 data frames in this package.

rdstern commented 4 years ago

I have returned to this topic as I write the help for R-Instat. I need a set of example data sets. We have our own and I have now been systematically through the R for Data Science book and the idea of mixing data sets used in the book with our own data sets has a good ring to it. They are easily available within R-Instat. So here is my list, with a few questions at the end. It is also partly a reference for myself.,

First those from the R for Data Science book:

  1. ggplot2 package: He uses the diamonds data set. He also makes use of 2 subsets called smaller and diamonds2 that we could discuss in the help on filters.
  2. ggplot2 package the mpg dataset is used. (So these are also obvious examples for us to use in the help on the graphics in R-Instat)
  3. ncflights3 package - discussed above and useful when we describe merge. Also see above we might include a merged version as an Instat example.
  4. From the datasets package the book uses mtcars and faithful.
  5. From the modelr package the data called heights is used
  6. From tidyr package the data called who is used. Chapter 12 also uses table1, etc, which I think we could use.
  7. From the forcats package the gss_cat data are used.

That seems to be it. Now "our" data sets.

  1. We have the dodoma data I have been using extensively as an example of climatic. I will soon call the existing data as dodoma13 and have a dodoma18 - up to 2018 as well.
  2. I also use the survey data from the rice survey.
  3. I will soon add the IFAD Kenya survey data. it is 1000 records and about 400 variables.
  4. @dannyparsons what set do you suggest for the procurement?
  5. I am still looking for one or two experimental datasets. There are some in agricolae that are possible. Or we can look in the ILRI or ICRAF sets?