Clean OLI data sets - Githubissues

rpruim commented 12 years ago

Pending the results of Nick's discussions with the folks from CMU, we should do some data cleaning in the OLI data sets. Perhaps we can use this issue to keep track of things we want to do.

Here are some starting suggestions based on looking at the head() of each dataset briefly (and sometimes glancing at the documentation):

rename variables using the lower case convention from the rest of the package
rename Actor to Actors
rename either the Friends data frame or its Friends variable (in general, I think it is confusing when there is a variable with the same name as the data set)
rename the Height data set (there is more than height in there and height is one of the variables -- a double whammy)
rename Population (and posthumously give it an award for stupidest data name ever?)
rename Ratings (too generic)
add additional variables to data sets, when possible (e.g., documentation for sleep says it is part of a larger study, if more data are available it would be nice to get them. within reason, more variables doesn't hurt even if you want to do univariate stuff)
reshape Sleep2 so that rows correspond to observational units. variables might be type (of student) and sleep
rename Time (perhaps ExerciseTime) -- it would be nice if there were additional variables in this data set.

Feel free to add other things as you come across them.

nicholasjhorton commented 12 years ago

For those keeping score at home, here's my summary of responses to your excellent review of things to consider for the CMU OLI datasets:

On Jun 29, 2012, at 9:51 PM, Randall Pruim wrote:

Pending the results of Nick's discussions with the folks from CMU, we should do some data cleaning in the OLI data sets. Perhaps we can use this issue to keep track of things we want to do.

Here are some starting suggestions based on looking at the head() of each dataset briefly (and sometimes glancing at the documentation):

rename variables using the lower case convention from the rest of the package

done for the most egregious case (EPA)

rename Actor to Actors

done

rename either the Friends data frame or its Friends variable (in general, I think it is confusing when there is a variable with the same name as the data set)

done

rename the Height data set (there is more than height in there and height is one of the variables -- a double whammy)

done

rename Population (and posthumously give it an award for stupidest data name ever?)

not done (though I'll second the nomination)

rename Ratings (too generic)

not done (though I see your point)

add additional variables to data sets, when possible (e.g., documentation for sleep says it is part of a larger study, if more data are available it would be nice to get them. within reason, more variables doesn't hurt even if you want to do univariate stuff)

we can add this if we get it (but at present, they provide almost no context and background)

reshape Sleep2 so that rows correspond to observational units. variables might be type (of student) and sleep

done

rename Time (perhaps ExerciseTime) -- it would be nice if there were additional variables in this data set.

not done

Feel free to add other things as you come across them.

Reply to this email directly or view it on GitHub: https://github.com/rpruim/mosaic/issues/121

Nicholas Horton Department of Mathematics and Statistics, Smith College Clark Science Center, Northampton, MA 01063-0001 http://www.math.smith.edu/~nhorton

rpruim commented 12 years ago

Just recording some thoughts for potential future cleaning activities:

Cellphones is perhaps a strange name for this data set. Perhaps CMUstudents? What do you think about satm and satv for new names for the first two variables?

> head(Cellphones)
  Math Verbal Credits Year Exer Sleep Cell  Veg
1  640    470      15    1   60   7.0  yes   no
2  660    650      14    1   20   7.5  yes   no
3  550    580      15    2    0   9.0   no   no

It is strange that Computers has no variable for satisfaction, since that was the context in which the data were collected.
I would like to see us recode all categorical variables represented as integers. That's so 80s.
There are some variables (sex/gender in particular) that appear in multiple data sets with different names. It would be good to smooth that out.
I would change the name of Olympics to Olympics1500 since the data are about a single race format.
I haven't looked at all of the examples for the data sets, but it would be nice to include some good example usage for these (once we have them all cleaned up and ready to go).

nicholasjhorton commented 12 years ago

I agree with many of these ideas.

More generally, I wonder about encouraging them to focus on a smaller set of better datasets which lend themselves to more than one question. The redundancy of Sleep and Sleep2 as well as the single variable datasets particularly jump out as suboptimal.

Just my $0.02,

Nick

On Jul 4, 2012, at 10:06 AM, Randall Pruim wrote:

Just recording some thoughts for potential future cleaning activities:

Cellphones is perhaps a strange name for this data set. Perhaps CMUstudents? What do you think about satm and satv for new names for the first two variables?
> head(Cellphones)
 Math Verbal Credits Year Exer Sleep Cell  Veg
1  640    470      15    1   60   7.0  yes   no
2  660    650      14    1   20   7.5  yes   no
3  550    580      15    2    0   9.0   no   no
It is strange that Computers has no variable for satisfaction, since that was the context in which the data were collected.

I would like to see us recode all categorical variables represented as integers. That's so 80s.

There are some variables (sex/gender in particular) that appear in multiple data sets with different names. It would be good to smooth that out.

I would change the name of Olympics to Olympics1500 since the data are about a single race format.

I haven't looked at all of the examples for the data sets, but it would be nice to include some good example usage for these (once we have them all cleaned up and ready to go).

Reply to this email directly or view it on GitHub: https://github.com/rpruim/mosaic/issues/121#issuecomment-6761051

Nicholas Horton Department of Mathematics and Statistics, Smith College Clark Science Center, Northampton, MA 01063-0001 http://www.math.smith.edu/~nhorton

rpruim commented 12 years ago

I'm all for good quality data, and in general I agree with you that fewer high quality data sets are much better than lots of so-so data sets.

If they can't part with some of their weaker data sets, that bothers me a little less, since we can always just ignore them. So I'd first push to clean up the good data sets and to make things as systematic and clean as possible. If we can also get rid of data sets that aren't so great, all the better.

I wonder if some of their examples can be redone either using others of their data sets or some of the data sets already in mosaic. If you find topics that really need better data sets, put out a call and perhaps we can locate something that works.

rpruim commented 11 years ago

Can we close this since we have pulled these data sets from the package?

nicholasjhorton commented 11 years ago

While I'm somewhat hesitant to completely pull the datasets (I've rekindled interest in fixing this for real with Candace and it will be abrupt: we haven't released the package a note that these would be deprecated), I concur that these weren't ideal for several reasons. So closing the issue is okay on my end.

Nick

On Jun 22, 2013, at 12:29 AM, Randall Pruim notifications@github.com wrote:

Can we close this since we have pulled these data sets from the package?

— Reply to this email directly or view it on GitHub.

Nicholas Horton Department of Mathematics and Statistics, Smith College Clark Science Center, Northampton, MA 01063-0001 http://www.math.smith.edu/~nhorton

nicholasjhorton commented 10 years ago

The OLI datasets are gone: may they RIP.

ProjectMOSAIC / mosaic

Clean OLI data sets #121