Closed rpruim closed 10 years ago
For those keeping score at home, here's my summary of responses to your excellent review of things to consider for the CMU OLI datasets:
On Jun 29, 2012, at 9:51 PM, Randall Pruim wrote:
Pending the results of Nick's discussions with the folks from CMU, we should do some data cleaning in the OLI data sets. Perhaps we can use this issue to keep track of things we want to do.
Here are some starting suggestions based on looking at the head() of each dataset briefly (and sometimes glancing at the documentation):
- rename variables using the lower case convention from the rest of the package
done for the most egregious case (EPA)
- rename Actor to Actors
done
- rename either the Friends data frame or its Friends variable (in general, I think it is confusing when there is a variable with the same name as the data set)
done
- rename the Height data set (there is more than height in there and height is one of the variables -- a double whammy)
done
- rename Population (and posthumously give it an award for stupidest data name ever?)
not done (though I'll second the nomination)
- rename Ratings (too generic)
not done (though I see your point)
- add additional variables to data sets, when possible (e.g., documentation for sleep says it is part of a larger study, if more data are available it would be nice to get them. within reason, more variables doesn't hurt even if you want to do univariate stuff)
we can add this if we get it (but at present, they provide almost no context and background)
- reshape Sleep2 so that rows correspond to observational units. variables might be type (of student) and sleep
done
- rename Time (perhaps ExerciseTime) -- it would be nice if there were additional variables in this data set.
not done
Feel free to add other things as you come across them.
Reply to this email directly or view it on GitHub: https://github.com/rpruim/mosaic/issues/121
Nicholas Horton Department of Mathematics and Statistics, Smith College Clark Science Center, Northampton, MA 01063-0001 http://www.math.smith.edu/~nhorton
Just recording some thoughts for potential future cleaning activities:
> head(Cellphones)
Math Verbal Credits Year Exer Sleep Cell Veg
1 640 470 15 1 60 7.0 yes no
2 660 650 14 1 20 7.5 yes no
3 550 580 15 2 0 9.0 no no
I agree with many of these ideas.
More generally, I wonder about encouraging them to focus on a smaller set of better datasets which lend themselves to more than one question. The redundancy of Sleep and Sleep2 as well as the single variable datasets particularly jump out as suboptimal.
Just my $0.02,
Nick
On Jul 4, 2012, at 10:06 AM, Randall Pruim wrote:
Just recording some thoughts for potential future cleaning activities:
- Cellphones is perhaps a strange name for this data set. Perhaps CMUstudents? What do you think about satm and satv for new names for the first two variables?
> head(Cellphones) Math Verbal Credits Year Exer Sleep Cell Veg 1 640 470 15 1 60 7.0 yes no 2 660 650 14 1 20 7.5 yes no 3 550 580 15 2 0 9.0 no no
- It is strange that Computers has no variable for satisfaction, since that was the context in which the data were collected.
- I would like to see us recode all categorical variables represented as integers. That's so 80s.
- There are some variables (sex/gender in particular) that appear in multiple data sets with different names. It would be good to smooth that out.
- I would change the name of Olympics to Olympics1500 since the data are about a single race format.
- I haven't looked at all of the examples for the data sets, but it would be nice to include some good example usage for these (once we have them all cleaned up and ready to go).
Reply to this email directly or view it on GitHub: https://github.com/rpruim/mosaic/issues/121#issuecomment-6761051
Nicholas Horton Department of Mathematics and Statistics, Smith College Clark Science Center, Northampton, MA 01063-0001 http://www.math.smith.edu/~nhorton
I'm all for good quality data, and in general I agree with you that fewer high quality data sets are much better than lots of so-so data sets.
If they can't part with some of their weaker data sets, that bothers me a little less, since we can always just ignore them. So I'd first push to clean up the good data sets and to make things as systematic and clean as possible. If we can also get rid of data sets that aren't so great, all the better.
I wonder if some of their examples can be redone either using others of their data sets or some of the data sets already in mosaic. If you find topics that really need better data sets, put out a call and perhaps we can locate something that works.
Can we close this since we have pulled these data sets from the package?
While I'm somewhat hesitant to completely pull the datasets (I've rekindled interest in fixing this for real with Candace and it will be abrupt: we haven't released the package a note that these would be deprecated), I concur that these weren't ideal for several reasons. So closing the issue is okay on my end.
Nick
On Jun 22, 2013, at 12:29 AM, Randall Pruim notifications@github.com wrote:
Can we close this since we have pulled these data sets from the package?
— Reply to this email directly or view it on GitHub.
Nicholas Horton Department of Mathematics and Statistics, Smith College Clark Science Center, Northampton, MA 01063-0001 http://www.math.smith.edu/~nhorton
The OLI datasets are gone: may they RIP.
Pending the results of Nick's discussions with the folks from CMU, we should do some data cleaning in the OLI data sets. Perhaps we can use this issue to keep track of things we want to do.
Here are some starting suggestions based on looking at the head() of each dataset briefly (and sometimes glancing at the documentation):
Feel free to add other things as you come across them.