IQSS / MatchingFrontier

MatchingFrontier: Computation of the Balance-Sample Size Frontier in Matching Methods for Causal Inference
https://iqss.github.io/MatchingFrontier/
3 stars 1 forks source link

does generateDataset() produce the data sorted by drop.order in the frontier object? #9

Closed mikedenly closed 8 months ago

mikedenly commented 8 months ago

Dear Noah et al,

Does generateDataset() produce the data sorted in descending order by drop.order in the frontier object? I ask because I am trying to run an ordered multilevel logit model after matching using makeFrontier, and my code works fine, assuming that using N for generateDataset() is in the right order. If not, I don't know what to do and would very much appreciate your assistance. I looked into the source code for generate_Dataset to see if I could figure it out. However, there are custom functions, such as customStop() and get_calling_function, that I couldn't locate, so I was unable to modify the code myself.

By the way, I was able to run the different models after makeFrontier by creating tidiers. I will add those tidiers as pull requests now.

Many thanks in advance for your consideration, Mike

mikedenly commented 8 months ago

@ngreifer or @ChristopherLucas, might either of you know the answer?

Many thanks, Mike

ngreifer commented 8 months ago

Hi Mike, I'm so sorry I missed this! generateDataset() uses the same rules as MatchIt::match.data() or MatchIt::get_matches(), depending on the matching type. If you use a pair distance-based frontier, then the data will be sorted in order of pair membership (and units will appear multiple times if they are reused), like MatchIt::get_matches(). Otherwise, the data will be in the order it was supplied with dropped units simply removed, like MatchIt::match.data().

That said, I don't think the order should matter for most modeling functions, and there is nothing wrong with simply re-ordering the output of generateDataset() as you please. The order that the observation happens to be in is incidental. Perhaps I am misunderstanding something because I can't think of an instance in which the order of the data should matter for a downstream analysis. Even if you are merging the matched dataset with some other data containing the outcomes, you can still include a unit ID in the dataset supplied to makeFrontier(), which will remain in the returned dataset.

Does this help? If not, please be more specific about what you plan to do after matching (i.e., what modeling function you want to run) and why the order of the dataset is so important.

mikedenly commented 8 months ago

Hi Noah, no problem at all for the delay and thanks so much for your helpful reply!

My sense--and please correct me and accept my apologies if I am wrong!--is that having generateDataset() sort by the drop order/balance statistic value matters a lot if we want to use the frontier for something else besides the stock linear regression included in the package. First, we want to know the treatment effect at different levels of treatment/control positivity, which is really necessary for causal inference (see Hernán and Robins book; Petersen et al in the van der Laan and Rose book on Targeted Learning). From both internal validity and external validity perspectives, we also want to be able to compare the treatment effect where the balance is maximized vs areas where balance is not as optimal. Although pruning observations helps with treatment/control positivity and internal validity, there is an external validity tradeoff if we prune too many observations and the sample ends up looking nothing like the target population. Your paper with Liz Stuart touches on this sample/target population positivity challenge in matching. For all of these reasons and given that I work on external validity, using the MatchingFrontier is my favorite way to do matching.

Now back to the challenge at hand: ensuring that generateDataset() does not just drop control observations at random or in any other order than according to the (im)balance statistic value from makeFrontier(), which should correspond to the drop.order. If anything other than the latter is taking place in generateDataset(), I can't be sure that I have included the right observations in my model if I want to drop them one at a time and mimic the frontier using another model than lm() that is already included in the package.

To make this issue more concrete, kindly refer to the code below. It is mimicking--or at least attempting to mimick--estimateEffects() using a doubly robust multilevel logit model with a random effect on the country variable. Then, the code is generating the relevant average marginal effects. For the moment, let's leave aside that: (a) the model below is not giving a true treatment effect; (b) using the package default would be fine for a singly-robust nonparametric treatment effect that is comparatively more useful than what I generate below; and (c) manually mimicking estimateEffects() will take hours to run if your dataset is large.

You will also note four steps in my process: (1) generate the frontier with makeFrontier(); (2) generate the extraction sequence based on the summary information in the frontier; (3) perform the estimation for each sample size in the frontier; and (4) plot the results, using the tidiers that I left in the other open issue for this package -- i.e., given that the package does not allow for pull requests.

# 1) generate the frontier
frontier_final <- makeFrontier(binary_outcome ~ binary_treatment + other_variable,
                                  data = df,
                            QOI = 'SATT',
                            metric = 'L1',
                            verbose= "TRUE")

# 2) generate the extraction sequence based on a summary of the frontier
frontier_info <- summary(frontier_final)
frontier_final
summary(frontier_final)
extract_sequence <- (frontier_info[[2]][[1]]-1):frontier_info[[2]][[2]]

# 3) calculate the marginal effects for each sample size
fit_and_calculate_me <- function(N) {
  cat("Processing N =", N, "\n")
  dataGen <- generateDataset(frontier_final, N = N)
  result <- tryCatch({
    model <- glmer(binary_outcome ~ binary_treatment + other_variable + (1 | Country), 
                   data = dataGen, family = binomial(link = "logit"),
                   glmerControl(optimizer = "bobyqa"))
    ggeffects_result <- ggeffects::ggpredict(model, terms = "binary_treatment")
    marginal_effects <- as.data.frame(ggeffects_result) %>%
      mutate(obs = N) 
    return(marginal_effects)
  }, error = function(e) {
    cat("Error with N =", N, ": ", e$message, "\n")
    return(data.frame(obs = N, error = "Model fitting failed", ErrorMessage = as.character(e$message)))
  })
  return(result)
}
datalist_me <- map(extract_sequence, fit_and_calculate_me)
tidy_datalist_me <- bind_rows(datalist_me)

# 4) the plot the results with help from my tidiers in the other open issue for this package
ggplot(tidy_datalist_me, aes(x = obs, y = predicted, color = as.factor(treatment))) +
  geom_line(size = 1) + 
  labs(title = "Average Marginal Effects", 
       x = "Number of Control Units Dropped", 
       y = "Predicted",
       color = "Treatment Level") +
  geom_vline(xintercept = tidy_estimates$obs[tidy_estimates$stat == min(tidy_estimates$stat)], linetype = "dashed") +
  theme_bw() +
  theme(axis.text.x = element_text(hjust = 1),
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5),
        plot.margin = margin(5.5, 5.5, 5.5, 5.5, "pt")) +
  scale_color_manual(values = c("0" = "grey", "1" = "black"), labels = c("treatment=0", "treatment=1")) +
  scale_x_continuous(expand = expansion(mult = c(0, 0)), 
                     limits = c(min(tidy_datalist_me$obs) - 0.5, max(tidy_datalist_me$obs) + 0.5))

avg_me

As you can see, my code runs and I am able to get a useful plot. However, I am just not sure that my x-axis in this plot is in the right order because: (i) I don't know if generateDataset() is doing what I want it to do; and/or (ii) if my invented extract_sequence is sufficient for overcoming any potential issues with generateDataset(). Also, I created tidiers in order to augment the plot with a vertical line, capturing where summary(frontier_final) is indicating that balance is maximized for causal inference purposes. But again, I don't know if my vertical line is correct, because I don't know if generateDataset() is giving me observations in the right order.

Does the above make sense? If not, I'd be happy to send you my data and code -- though I'd ask you not to share with it others given that the project is far from finalized. In any case, many thanks for reading this long thread and your truly excellent work here, on Stack Exchange, and all of your incredibly useful packages!

ngreifer commented 8 months ago

Hi Mike,

I'm sorry, I totally misunderstood your original question. generateDataset() does indeed subset to just the units remaining at the given point on the frontier; it doesn't randomly drop observations. Running generateDataset() at each point on the frontier and fitting a linear regression model for the treatment effect is exactly the same as using estimateEffects(), which you should be able to verify yourself using your function. Anything otherwise would defeat the whole purpose of the function, which you clearly understand (but I would never let that happen).

BTW, my (development) version of MatchingFrontier supports logistic regression for binary outcomes, too, and uses g-computation instead of just reporting the treatment coefficient. That's still more restrictive than using any arbitrary model, but we are aware that lm() isn't the only game in town. I also do accept pull requests on my version, but I don't think I would accept your proposals for a few reasons (though I think it is great that you wrote those functions and think others will find them useful!). I agree that some of the syntax, code, argument names, and outputs are a bit clunky; this is because I took over the package from others and wanted to respect the original interface, even if there may have been better ways to store the output, etc. Were I to write this package from scratch it might look totally different (though at least it is possible to recreate all outputs exactly by hand as it is written now, though it may take some work). One of the great things about R is that users can write their own helper functions like you did and share them with others. (For those reading, see #10 for those functions.)

FYI, it is possible to request black and white plots using estimateEffects() by using the band.col and band.border.col arguments to plot().

Thank you for the kind words! I'm glad someone is using this package; I think it has great potential and I would devote more time to make it even better if I could. I would love to see your final results when you have a draft of the paper ready to view. And if you're ever interested in testing out some fancy new matching methods, let me know :) And of course let me know if there is anything else I can help you with.

mikedenly commented 8 months ago

Thank you, Noah, for your super helpful response and engagement! That's very cool that you are adding support for the other models. To be honest, it took me an embarrassing amount of time to figure out how to create functions for other types of models, especially the multilevel ordered models due to the ordinal package's limited ability to create marginal effects for multilevel structures. However, I am sure an expert like you could have done it in 1/10 the time.

That's totally fine to not accept my tidiers, too -- I completely understand! As you mention, others may copy and paste them as they wish from the other issue if they wish. I do think that my vertical line denoting where balance is maximized on estimateEffects() might be something to copy for your next release, though, as I think it captures one of the main benefits of this incredible package. Speaking of releases, is there a particular reason why this package never made it back to CRAN?

In any case, I just wanted to reiterate what amazing work you are doing for this package and your other packages.

Thank you again and have a great day, Mike

ngreifer commented 8 months ago

Thank you again for the kind words :)

One reason I am hesitant to include the vertical line is that the point where balance is optimal is not necessarily the only point along the frontier for which the effect estimate should be interpreted. For example, for pair distance-based frontiers, the optimal balance is at the very smallest sample size! Even for other frontier types, the spot where an author might want to focus a reader's attention might not be at the point with the best balance as measured by the optimizing statistic alone but rather by the other balance statistics available when plotting the balance-sample size frontier or at a given sample size. For example, in an analysis I ran using this method, I chose the point on the energy distance frontier which had the largest sample size but such that all SMDs were below .05, which was not necessarily the same spot where the energy distance was smallest, since that point discarded too many units for little improvements in balance. The user of course can narrow the width of the effect estimates plot or add their own vertical line anywhere using geom_vline() combined with the output from summary() to select at which point on the frontier the reader's eyes should be focused. Because there are so many ways to use the method and package, I don't want to make available too much functionality that nudges users in a specific direction. A user needs a strong understanding of the method to be able to use it. But I strongly encourage sharing user-written tools to facilitate specific use cases! That's what R is all about.

I really should put it on CRAN, and I don't know why I haven't. It has been a low priority for me because I have so many other packages to work on at the moment. Also, though I consider myself to be the primary developer of this version of the package, I feel I would need to get consent from all coauthors to publish anything (including on CRAN), which I also haven't bothered to do. Some new features of the package are experimental and the other coauthors and original developers might not want their name on a method that has not been thoroughly tested. But I do have plans both to get this on CRAN and write a paper about the package (perhaps for JSS or the R Journal). Those plans may be a bit far off given my other obligations, but your interest in the package has reminded me that there may still be use in continuing to work on it.

mikedenly commented 8 months ago

Thanks for the super helpful information on why you chose not to include the vertical line for maximum balance! To be honest, the first I had seen of the energy distances was in this package, and it is hard for me to know which balance statistic to choose. Initially, I chose the L1 over the Mahalanobis distance mainly for reasons pertaining to the estimand: the L1 allowed for an SATT, whereas the Mahalanobis distance only allowed for an FSATT. Your summary article explains the different options for balance, but it doesn't tell us how to pick between them. Of course, that may be a deliberate choice, as I could see why you may want to stay away from such a discussion. However, perhaps there is a summary article that you could refer readers to? Along those lines, it is really clever what you did with SMDs on the energy distance below .05, but I don't think many people in the world would have thought of that independently. Maybe that is something for your JSS or R journal article? In any case, thanks again for being so generous with your time, feel free to reach out if I can be helpful with anything, and I will definitely take you up on your really generous offer to take a look at my paper when it is ready.

ngreifer commented 8 months ago

Yeah the energy distance is a newer balance statistic, but I am really fascinated by it and have been trying to implement it in my packages where possible. I would definitely recommend trying out an energy distance frontier; in practice, it tends to work really well at balancing covariates. I don't describe how to choose among because to me only the energy distance is worth using as an actual matching method (I see the others as more pedagogical than practical) but I don't want to make such a claim as if it were truth because the others do have good properties in some cases and were the focus of the original article on the matching frontier. For example, it is true without a doubt that when the Mahalnobis distance is zero, there is no bias due to the covariates, but to get to such a value requires dropping so much of your sample that it is prohibitive. Similarly, it is true that when the L1 statistic is zero with little or no coarsening of the covariates, there is no bias due to the covariates. But getting to zero on the L1 statistic also involves dropping a lot of units, and getting to zero with heavy coarsening doesn't produce great balance. Theory doesn't really have much to say between high values and zero for each metric, but unfortunately the whole point of the package is making a choice along the frontier between high values and zero.

I describe a bit about how to choose among the L1 statistic, energy distance, and other measures here in a different package (note the Mahalanobis distance discussed there differs from the one used here). In the end I don't have much guidance. Using the more familiar balance statistics as supplementary heuristics can be helpful, I think, which is why I combined the energy distance with SMDs in my application. Note this is a problem in all matching and weighting methods that use optimization (e.g., twang-style GBM, genetic matching), not just the matching frontier.

The paper where I used the energy distance frontier is under review and I describe what I did there, so hopefully it gets published and you can see what I did. In the end, it was the only matching method that allowed us to achieve balance remotely well. Unfortunately it's another thing where if I had the time to write more about it, I would, but I don't get paid to do that.