Open tslumley opened 7 years ago
Hey Thomas,
I'm not familiar with the domain, but the problem sounds interesting. What is the difficulty in mapping this data to ggplot2
aesthetics currently?
Do you have an example visual you would like to shoot for?
The problem is primarily how to specify the design object as a data argument, but there's also the secondary issue that weights and clusters should ideally be provided to geoms.
Consider these two from ?stat_smooth
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()
and
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ splines::bs(x, 3), se = FALSE)
A survey design object is not a data frame and doesn't inherit from data.frame: it contains a data frame and also contains a bunch of metadata including weights and cluster identifiers.
In a perfect world, I'd be able to do
ggplot(nhanes_des, aes(AGE, DBP)) +
geom_point() +
geom_smooth()
to get a weighted scatterplot of DBP against age (weighted using alpha channel or using hexbins or using thinning, all as in survey::svyplot), with a weighted smoother.
I could get the data into ggplot() by supplying the original data frame, but that will only work if no transformations have been done on the survey object (including to construct the metadata). Also, the survey package functions for smoothing, fitting lines, and so on expect their data as survey design object, not a data frame.
The simplest solution may be a set of adapters that convert to data frames going into ggplot and to design objects coming out?
This is an interesting and exciting idea. I think there are at least a couple of potential approaches:
You could create your own ggweighted
functions that take in a non-data.frame
object and perform the required plotting steps, wrapping ggplot() + geom_*
calls with some logic depending on what is required.
You could build your own geom_*
functions but I suspect you'll need someone very familiar and comfortable with ggplot2
internals/extensions to get it working (I may be wrong, maybe it's easy). Have a look at the ggforce
package where ggproto
is used extensively. The extension framework is there, but you'll need someone who knows how it works. For example, sf
objects are now supported in ggplot2
with further extension.
You could extend geom_smooth
(or whichever geom_
) to automatically "just work" when the main data is a weighted object. This may be best suited as a PR to ggplot2
itself or renamed into its own package.
As you suggest, do the required processing in a new function to prepare the data into a tidy format ready for ggplot()
.
I'm keen to hear other takes on this too.
ggproto
etc is not too difficult to get to grips with. I managed to hack together a geom for Nick's naniar
.
There's a further option which is inheriting from geom_smooth
and just overloading the method that does data preprocessing. I'd need to clear this with the expert, but a hacky way to achieve the weighted smoother might just to replicate the data points proportionally to the weights and use the regular smoothing algorithm.
Hello, I often work with survey data, with the survey
package (huge thanks to Thomas Lumley for this great tool!) and use ggplot2 to plot results. I developped my own functions to streamline this process. Sorry if it's a bit lengthy, but I had no need to optimize. Here is an example of generic function to produce a barplot with the result of the svyby function:
svyby_barplot <- function(x, by, design){
# Loads required packages if not already loaded
library(labelled)
library(reshape2)
library(ggplot2)
library(survey)
# Extracts variable names to insert them as legend in the plot
var_x <- paste(deparse(substitute(design)), "$variables$", x[2], sep = "")
lab_x <- var_label(eval(parse(text = var_x)))
lab_leg <- paste(deparse(substitute(design)), "$variables$", by[2], sep = "")
lab_leg <- var_label(eval(parse(text = lab_leg)))
# Uses the survey package to compute estimates and confidence intervals
byout <- svyby(x, by, design, FUN=svymean, na.rm=TRUE, vartype="ci")
## Fixing an apparent bug of survey package that pastes variable name and modality name in the svyby output
# Counting the number of character to delete
y <- as.numeric(nchar(x)[2])
#prepares a table with the right format for ggplot2
byout[,-1] <- byout[,-1]*100
col_est <- (ncol(byout)-1)/3
estimates <- melt(byout[,1:col_est+1])
low_ci <- melt(byout[,(col_est+1):(col_est*2)+1])
high_ci <- melt(byout[,(col_est*2+1):(col_est*3)+1])
out <- cbind(estimates, low_ci[,2], high_ci[,2])
out$by <- byout[,1]
colnames(out) <- c("cat", "est", "lci", "hci", "by")
# if y, take out the variable name
if(y>0){
out[,1] <- substring(out[,1], y+1)
}
# si, factorisée, on réintègre les facteurs de la variable d'origine
# pour afficher les résultats dans le même ordre
if (!is.null(levels(eval(parse(text = var_x))))) {
out$cat <- factor(out$cat, levels(eval(parse(text = var_x))))
}
# Produces the graph
graph <- ggplot(data=out, aes(x=cat, y=est, fill=by))
graph + theme(panel.border = element_blank(),
panel.grid.major = element_blank()) +
geom_bar(stat="identity", position=position_dodge()) +
geom_errorbar(aes(ymin=lci, ymax=hci), width=.1, position=position_dodge(0.9) ) +
geom_point(position=position_dodge(0.9)) +
scale_y_continuous(limits = c(-0.5,100)) +
theme (axis.text.x = element_text(angle=90, vjust=0.5)) +
scale_fill_discrete(name=lab_leg) +
# labs(title = svy_lab) +
xlab(lab_x) +
ylab("%")
}
The DHS program (one of the main technical assistance programs to statistical offices in developing countries) provides example datasets of health survey data at this page: http://dhsprogram.com/data/Download-Model-Datasets.cfm Here is a demo of the previous function produced from the household dataset (zzhr62dt.zip).
library(haven)
library(survey)
dhs_demo <- as_factor(read_dta("ZZHR62FL.DTA"))
dhs_design <- svydesign(ids=~hv021+hv002, strata=~hv023, weights=~hv005, data=dhs_demo)
svyby_barplot(~hv201, by=~hv025, design=dhs_design)
It produces the following graph:
I use similar functions to produce maps or more readable tables to print with knitr/RMarkdown.
I'm not sure that it helps, but in any case... Best regards
seems useful. Thank you!
Either for its own sake or as an example of how to do it, develop add-on to ggplot2 to allow weighted survey data.
Data from complex multistage surveys comes with weights and clusters that are often relevant for display, whether directly (opacity, hexbinning) or in stat() and stat_smooth() aesthetics. I don't know from ggplot2, so it would be good to work with people who do.