Survey data into ggplot2

tslumley commented 7 years ago

Either for its own sake or as an example of how to do it, develop add-on to ggplot2 to allow weighted survey data.

Data from complex multistage surveys comes with weights and clusters that are often relevant for display, whether directly (opacity, hexbinning) or in stat() and stat_smooth() aesthetics. I don't know from ggplot2, so it would be good to work with people who do.

MilesMcBain commented 7 years ago

Hey Thomas, I'm not familiar with the domain, but the problem sounds interesting. What is the difficulty in mapping this data to ggplot2 aesthetics currently?

Do you have an example visual you would like to shoot for?

tslumley commented 7 years ago

The problem is primarily how to specify the design object as a data argument, but there's also the secondary issue that weights and clusters should ideally be provided to geoms.

Consider these two from ?stat_smooth

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth()

and

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ splines::bs(x, 3), se = FALSE)

A survey design object is not a data frame and doesn't inherit from data.frame: it contains a data frame and also contains a bunch of metadata including weights and cluster identifiers.

In a perfect world, I'd be able to do

ggplot(nhanes_des, aes(AGE, DBP)) +
  geom_point() +
  geom_smooth()

to get a weighted scatterplot of DBP against age (weighted using alpha channel or using hexbins or using thinning, all as in survey::svyplot), with a weighted smoother.

I could get the data into ggplot() by supplying the original data frame, but that will only work if no transformations have been done on the survey object (including to construct the metadata). Also, the survey package functions for smoothing, fitting lines, and so on expect their data as survey design object, not a data frame.

The simplest solution may be a set of adapters that convert to data frames going into ggplot and to design objects coming out?

jonocarroll commented 7 years ago

This is an interesting and exciting idea. I think there are at least a couple of potential approaches:

You could create your own ggweighted functions that take in a non-data.frame object and perform the required plotting steps, wrapping ggplot() + geom_* calls with some logic depending on what is required.
You could build your own geom_* functions but I suspect you'll need someone very familiar and comfortable with ggplot2 internals/extensions to get it working (I may be wrong, maybe it's easy). Have a look at the ggforce package where ggproto is used extensively. The extension framework is there, but you'll need someone who knows how it works. For example, sf objects are now supported in ggplot2 with further extension.
You could extend geom_smooth (or whichever geom_) to automatically "just work" when the main data is a weighted object. This may be best suited as a PR to ggplot2 itself or renamed into its own package.
As you suggest, do the required processing in a new function to prepare the data into a tidy format ready for ggplot().

I'm keen to hear other takes on this too.

MilesMcBain commented 7 years ago

ggproto etc is not too difficult to get to grips with. I managed to hack together a geom for Nick's naniar.

There's a further option which is inheriting from geom_smooth and just overloading the method that does data preprocessing. I'd need to clear this with the expert, but a hacky way to achieve the weighted smoother might just to replicate the data points proportionally to the weights and use the regular smoothing algorithm.

fBedecarrats commented 7 years ago

Hello, I often work with survey data, with the survey package (huge thanks to Thomas Lumley for this great tool!) and use ggplot2 to plot results. I developped my own functions to streamline this process. Sorry if it's a bit lengthy, but I had no need to optimize. Here is an example of generic function to produce a barplot with the result of the svyby function:

svyby_barplot <- function(x, by, design){
    # Loads required packages if not already loaded  
    library(labelled)
    library(reshape2)
    library(ggplot2)
    library(survey)
    # Extracts variable names to insert them as legend in the plot
    var_x <- paste(deparse(substitute(design)), "$variables$", x[2], sep = "")
    lab_x <- var_label(eval(parse(text = var_x)))
    lab_leg <- paste(deparse(substitute(design)), "$variables$", by[2], sep = "")
    lab_leg <- var_label(eval(parse(text = lab_leg)))
    # Uses the survey package to compute estimates and confidence intervals 
    byout <- svyby(x, by, design, FUN=svymean, na.rm=TRUE, vartype="ci")
    ## Fixing an apparent bug of survey package that pastes variable name and modality name in the svyby output
    # Counting the number of character to delete
    y <- as.numeric(nchar(x)[2])
    #prepares a table with the right format for ggplot2
    byout[,-1] <- byout[,-1]*100
    col_est <- (ncol(byout)-1)/3
    estimates <- melt(byout[,1:col_est+1])
    low_ci <- melt(byout[,(col_est+1):(col_est*2)+1])
    high_ci <- melt(byout[,(col_est*2+1):(col_est*3)+1])
    out <- cbind(estimates, low_ci[,2], high_ci[,2])
    out$by <- byout[,1]
    colnames(out) <- c("cat", "est", "lci", "hci", "by")
   # if y, take out the variable name
    if(y>0){
     out[,1] <- substring(out[,1], y+1)
   }  
  # si, factorisée, on réintègre les facteurs de la variable d'origine 
  # pour afficher les résultats dans le même ordre
  if (!is.null(levels(eval(parse(text = var_x))))) {
    out$cat <- factor(out$cat, levels(eval(parse(text = var_x)))) 
  }

  # Produces the graph
  graph <- ggplot(data=out, aes(x=cat, y=est, fill=by))
  graph + theme(panel.border = element_blank(),
                panel.grid.major = element_blank()) + 
    geom_bar(stat="identity", position=position_dodge()) +
    geom_errorbar(aes(ymin=lci, ymax=hci), width=.1, position=position_dodge(0.9) ) +
    geom_point(position=position_dodge(0.9)) +
    scale_y_continuous(limits = c(-0.5,100)) +
    theme (axis.text.x = element_text(angle=90, vjust=0.5)) +
    scale_fill_discrete(name=lab_leg) +
#    labs(title = svy_lab) +
    xlab(lab_x) +
    ylab("%")

}

The DHS program (one of the main technical assistance programs to statistical offices in developing countries) provides example datasets of health survey data at this page: http://dhsprogram.com/data/Download-Model-Datasets.cfm Here is a demo of the previous function produced from the household dataset (zzhr62dt.zip).

library(haven)
library(survey)
dhs_demo <- as_factor(read_dta("ZZHR62FL.DTA"))
dhs_design <- svydesign(ids=~hv021+hv002, strata=~hv023, weights=~hv005, data=dhs_demo)
svyby_barplot(~hv201, by=~hv025, design=dhs_design)

It produces the following graph: rplot

I use similar functions to produce maps or more readable tables to print with knitr/RMarkdown.

I'm not sure that it helps, but in any case... Best regards

GuoQi0901 commented 5 years ago

seems useful. Thank you!

AU-BURGr / UnConf2017

Survey data into ggplot2 #6