ProjectMOSAIC / mosaic

Project MOSAIC R package
http://mosaic-web.org/
93 stars 26 forks source link

cor(), cov() and missing data #675

Closed dtkaplan closed 7 years ago

dtkaplan commented 7 years ago

I had occasion to calculate a correlation coefficient (yuck!). The documentation for mosaic::cor() doesn't give any hint about what to do when there is missing data. The obvious think, using na.rm=TRUE doesn't work. Try this:

Min_data <- data.frame(
  x = c(1,2,3,4),
  y = c(1,3,2,NA)
)

mosaic::cor(y ~ x, data = Min_data)
mosaic::cor(y ~ x, data = Min_data, na.rm = TRUE)
mosaic::cor(y ~ x, data = Min_data, use = "pairwise")

Two solutions I can see:

  1. Adding a default use argument to mosaic::cor (and mosaic::cov) This I think would violate the unity of the aggregating functions, and who would remember to use use unless we set its default value to use = ifelse(na.rm, "pairwise", "all")
  2. Just add use = ifelse(na.rm, "pairwise", "all") to the arguments to stats::cor (and stats::cov) that are made from mosaic::cor or cov.
nicholasjhorton commented 7 years ago

Can an example be added into the documentation to help guide the user?

rpruim commented 7 years ago

Here is your example:

tally(is.na(mcs) ~ is.na(pcs), data = HELPmiss)
##           is.na(pcs)
## is.na(mcs) TRUE FALSE
##      TRUE     2     0
##      FALSE    0   468
cov(mcs ~ pcs, data = HELPmiss)             # NA because of missing data
## [1] NA
cov(mcs ~ pcs, data = HELPmiss, use = "complete")  # ignore missing data
## [1] 13.46433
# alternative approach using filter explicitly
cov(mcs ~ pcs, data = HELPmiss %>% filter(!is.na(mcs) & !is.na(pcs)))    
## [1] 13.46433