ProjectMOSAIC / mosaicData

R package with Project MOSAIC datasets
5 stars 6 forks source link

Gestation data: only one value present for sex #11

Closed homerhanumat closed 8 years ago

homerhanumat commented 8 years ago

Hi,

In the man for the Gestation data frame, the variable sex is listed as having two values, but only one (1 for males) is appears in the data itself. Is this intended?

rpruim commented 8 years ago

The source file for this data set appears to be http://www.stat.berkeley.edu/~statlabs/data/babies23.data and has the same issue.

> read.file("http://www.stat.berkeley.edu/~statlabs/data/babies23.data") -> G
Reading data with read.table()
> dim(G)
[1] 1236   23
> dim(Gestation)
[1] 1236   23
> tally( ~sex, data = G)

   1 
1236 

@nicholasjhorton , do you know anything more about this data set?

rpruim commented 8 years ago

While we are here. We should probably recode the data. I see lots of integer codes. As is, we really aren't doing anything that couldn't be gotten by loading the data from the URL. Only one variable has been converted to a factor, and the conversion is less than ideal:

tally( ~drace, data = Gestation)

##    0    1   10    2    3    4    5    6    7    8 NANA <NA> 
##  489   42   10   45   77   50  157   41  250   39    5   31 
nicholasjhorton commented 8 years ago

StatLabs page 198 notes that these dataset are 1,236 male single births where the baby lived at least 28 days. So we might want to note that these are male births (and drop the sex variable). I've done this: see https://github.com/ProjectMOSAIC/mosaicData/commit/cc0c40c1f729052c3bebe6b6a7474c45bdcda861

rpruim commented 8 years ago

Perhaps keeping the sex variable is a good thing just to make it abundantly clear that this is a data set including only males. (Updating documentation is also good.)