Open JunsolKim opened 2 years ago
Would appreciate more explanations on how splitting testing and training data helps address "fundamental problem of causal inference with latent variables," or the dependence between units when developing the mapping function with reference to the outcome of treatment. Is it the case that the splitting does not eliminate the dependence, but assure us that the dependence does not matter?
This was an interesting set of chapters. In the chapter on text as a confounding variable, the authors describe different ways of accounting for text as a confounding variables, given it's much harder to so than to account for numeric confounding variables. My question is, for the less computationally-inclined social scientists among us, how important is it to truly understand the difference between the different regression estimators like the Ridge and Lasso methods?
Would appreciate more explanations on how splitting testing and training data helps address "fundamental problem of causal inference with latent variables," or the dependence between units when developing the mapping function with reference to the outcome of treatment. Is it the case that the splitting does not eliminate the dependence, but assure us that the dependence does not matter?
+1! Adding onto that, the book talks about drawbacks of discovering g by just looking at the documents versus discovering it after splitting the data instead. When we deal with longitudinal studies, do we slice data as per a given metric and then repeat this process for every subset? Or do we randomly sample across time to maintain a consistent mapping for every window of time? I guess both might make sense for different reasons, but would like to know what everyone thinks.
This was an interesting set of chapters. In the chapter on text as a confounding variable, the authors describe different ways of accounting for text as a confounding variables, given it's much harder to so than to account for numeric confounding variables. My question is, for the less computationally-inclined social scientists among us, how important is it to truly understand the difference between the different regression estimators like the Ridge and Lasso methods?
You pretty much don't need to know the math to actually work with it in an applied setting. For Ridge and Lasso, the main thing you should know is that they focus on increasing bias as a cost to decreasing the model's variance. So in settings where your parameters are large or greater than your sample size, you definitely need to use these methods. Further, choosing Lasso usually means that you believe many of the predictors don't have a real effect (or small effect) on the dependent variable. Whereas ridge regression decreases standard errors and works well when you have strong effect sizes on many predictors. Just with this information alone, you can make educated decisions about what type of model you should employ even if you never saw the equations themselves.
@isaduan Just a guess: I think this is to solve the overfitting problem. If you overfit the text, you may extract non-systematic features that only imply something about this particular text. Splitting the sample would cross-validate that we are identifying some "common features".
Would appreciate more explanations on how splitting testing and training data helps address "fundamental problem of causal inference with latent variables," or the dependence between units when developing the mapping function with reference to the outcome of treatment. Is it the case that the splitting does not eliminate the dependence, but assure us that the dependence does not matter?
I think you're right: according to the authors, we get g from the training set, but the causal effect is estimated from the test set, where g "does not depend on the treatment status of any unit." In other words, the dependence would not affect the results since they are estimated from a separate set of data that has never been used to discover and refine g. My question is, in the revalidation step, what "important information" does a failed test set tell us? How can we improve based on this information, given that we cannot go back to the training set and change g?
A couple of questions on the calculation of average causal effect - 1) How are average outcomes calculated for cardinal variables? 2) The author explains that to find the mean causal effect using only the factual observations, the number of factual observations have to be equal to counterfactuals for both Treatment and Control classes. The author further explains that this only happens when the outcomes are not related to the treatment. But this what we were trying to study from the start - the effect of treatment on the outcome. Doesn't this become like the chicken and egg problem? 3) I still don't understand how one can assure that the outcome are a result of the treatment and not another covariate which correlates to the factual of both Treatment and Control group.
Don't understand the third assumption of causal inference: "Positivity--that all units have at least some probability of receiving treatment. (Page 267). Is it necessary to make such an assumption?
Would appreciate some clarifications of causal inferences in general. I am not very sure on how causal relationships can be draw from text without an experimental setting.
The text as treatment chapter elucidates on discovering multiple latent treatments and then measuring each treatment's effect on the outcome. This gives me the impression that isolating only one latent treatment of interest would be too difficult to do, having to account for everything else in the text which could act confounders. Is there a way to do this?
1) I'm interested in how it would be if we compare the method of using text data to do causal inference with randomized experiments. It seems that they have some similar procedures. It seems that by using text data analysis, we wanna synthetically recreate a setting that is very similar to a randomized experiment! 2) Machine learning methods are typically assessed on their out-of-sample predictive power, but in causal inference, we are usually asked for giving confidence intervals for the causal estimands of interest. How machine learning can make decisions regarding which treatment is best for a given unit, and whether a treatment is worth implementing?
Unfortunately I didn't have time this week to get through all the chapters. From a brief skim, I'm wondering whether we can use text in more than one of these roles simultaneously, or if that introduces too much complexity. For example, we might want to know whether a characteristic of a certain document (treatment) is causally linked to the presence of another characteristic later in the document (outcome).
Chapter 24 mentioned the two types of beginnings in the causal inference flow. Predefining causal relationships that researchers expect to see and intend to validate make more sense to me than playing with the training dataset. I think tI learned a lot from the part of 'text as treatment' and 'text as outcome'. I would really want to know the strength of textual data in causal inference, and in which aspect it outperforms simply quantifying some textual features such as would frequency.
I‘d like to know more about study designs involving this text data causal inference. There are multiple ways you can give input to the model -- you can train it on different corpus, you can change the contexts when making prediction, and you can change the way it is trained. I think all of these could be some kind of treatments.
The book mentions ‘shared task’ in p.237. I am not very clear about what exactly does it refer to, and why it is less useful for causal inference than prediction.
I have a question about doing causal inference using textual data. Although word embedding models can help us convert text into high-dimensional numerical data, I was wondering how can we infer on these dimensions, or the aggregate X dimension? Since some of the entries are from word vectors, there are no exact social science meaning for them?
Making casual inferences has long been a very difficult problem, and I believe the author is proposing a great framework. I do believe that understanding the working mechanism behind models (like Lasso and Ridge models) is important. However, I wonder how much more we can get out from different models. (I'm thinking of last week's model addressing the difference between social science and computer science.)
I'd like to know some fundamental causal inference frameworks outside the content analysis context before applying it to content analysis.
There is a lot to unpack for this week's reading. One question that most intrigues me is that chapter 23 discusses prediction and chapter 24 goes over causal inference, and I was wondering the differences between these two and how they are best applied under what situations?
I'd like to read more on how "statistical" causal inference methods are generally different from "contextual" causal inference methods.
I am not quite sure about the difference between casual inference models and regression models. Is it because regression method loses much information embedded in text data/ non-text data?
The discussion of Nowcasting was of interest to me, and I wonder how might the casual inference models be used to measure or understand a particular social media community online? What are the limits of attempting to do this?
Post questions here for this week's fundamental readings: Grimmer, Justin, Molly Roberts, Brandon Stewart. 2022. Text as Data. Princeton University Press: Chapters 23, 24, 25, 26, 27 —“Prediction”, “Causal Inference”, “Text as Outcome”, “Text as Treatment”, “Text as Confounder”.