Open JunsolKim opened 2 years ago
I don't think I understand how causal text embeddings are different from word embeddings. I understand that they talk about causal BERT by adding a linear mapping and two layers of neural net, however, what exactly is happening under the hood when we talk about causal text embeddings being different from word embedding models?
I don't think I understand how causal text embeddings are different from word embeddings. I understand that they talk about causal BERT by adding a linear mapping and two layers of neural net, however, what exactly is happening under the hood when we talk about causal text embeddings being different from word embedding models?
+1 would appreciate detailed explanations of what's under the hood! My not-very-informed view is that they adjust objective functions of embedding models - so we are not seeking good representation of all raw texts, we are seeking good representation of causally relevant part of the raw texts! Also, the authors mention "the black box nature of the embedding methods makes it difficult for practitioners to assess whether the causal assumptions hold." What are ways that we may assess those assumptions?
This reading as well as one of the fundamental readings for this class mine reddit threads for text to analyze. I'm quite surprised to see reddit being used this frequently in textual analysis, and I wonder how authors of papers utilizing reddit as a source account for each subreddit's unique culture, demographic makeup, and content? The authors don't seem to mention limitations of this regard at all in the article. Especially given reddit's reputation of serving as an echo chamber for certain political views (consider r/theDonald), I'm surprised the authors were able to conduct such an analysis -- especially on a topic like sexism -- without at least addressing the potential flaws in the methodological setup.
This reading as well as one of the fundamental readings for this class mine reddit threads for text to analyze. I'm quite surprised to see reddit being used this frequently in textual analysis, and I wonder how authors of papers utilizing reddit as a source account for each subreddit's unique culture, demographic makeup, and content? The authors don't seem to mention limitations of this regard at all in the article. Especially given reddit's reputation of serving as an echo chamber for certain political views (consider r/theDonald), I'm surprised the authors were able to conduct such an analysis -- especially on a topic like sexism -- without at least addressing the potential flaws in the methodological setup.
While reading other articles in this class, I also felt similar concerns as you raised - many dataset like those subreddit comments seem to have very unique and specific contexts that largely limit their external validity in answering more general social sciences questions. However, for this study, my understanding is that the authors are trying to test the method for empirical evaluation they proposed, using real data from the two examples. In this case I think representativeness would be less of a concern. Even if they are only able to draw conclusions on these very specific subreddits, that wouldn't harm since what they care about is whether the tools they used can successfully produce representation of text for causal adjustment.
I don't think I understand how causal text embeddings are different from word embeddings. I understand that they talk about causal BERT by adding a linear mapping and two layers of neural net, however, what exactly is happening under the hood when we talk about causal text embeddings being different from word embedding models?
+1 would appreciate detailed explanations of what's under the hood! My not-very-informed view is that they adjust objective functions of embedding models - so we are not seeking good representation of all raw texts, we are seeking good representation of causally relevant part of the raw texts! Also, the authors mention "the black box nature of the embedding methods makes it difficult for practitioners to assess whether the causal assumptions hold." What are ways that we may assess those assumptions?
I second this question on how to assess causal assumptions. I guess I'm still not clear about how is causality inferred using causal text embeddings.
Pretty hard to follow their math without sitting down and working it out for hours. But that aside, wouldn't papers that include theorems be just better papers in general so they are accepted more often? So let's say a researcher added a fairly useless "theorem" in the paper, does this negate their causal inference assumption and it only counts if the "theorem" mentioned is relevant and important? This could happen if there is a qualitative difference in papers that use theorems compared to those that don't.
I don't think I understand how causal text embeddings are different from word embeddings. I understand that they talk about causal BERT by adding a linear mapping and two layers of neural net, however, what exactly is happening under the hood when we talk about causal text embeddings being different from word embedding models?
+1 would appreciate detailed explanations of what's under the hood! My not-very-informed view is that they adjust objective functions of embedding models - so we are not seeking good representation of all raw texts, we are seeking good representation of causally relevant part of the raw texts! Also, the authors mention "the black box nature of the embedding methods makes it difficult for practitioners to assess whether the causal assumptions hold." What are ways that we may assess those assumptions?
++1 ! my understanding is that structurally they're not much different but that the difference is in the content and thus utility, crafted specifically for downstream prediction tasks - closest analogy I can think of is feature selection in supervised learning (though I could be understanding this completely wrong). My other question is something the paper has already brought up - I'm interested in applying this to estimating how specific, word-level, linguistic use can be isolated as a treatment for effects. And I'm thinking both semantically and syntactically.
Here are some concerns when reading this article:
1) I'm wondering how much human labor and how many interventions should be involved here to achieve great precision. There are many types of high-dimensional data, and text data is unique since it can be readable and interpretable by humans. This means that during this process, we can actually stop somewhere and do a hand check or evaluation to make adjustments for further steps.
2) The conditions in this article sound very ideal. I feel extremely hard to design meaningful representations of text. We're here to eliminate confounding in the end. But if we cannot define and catch the accurate features in the text, would this bring more confounding in the causal relationships?
3) On the assumption that "features that are useful for language understanding are also useful for eliminating confounding," is it true?
This is a very interesting paper, and I've never thought about using machine learning to perform causal inference before. An impressive takeaway here is that the paper borrowed the idea of weighting the sample with propensity score from the matching methods. Below are my questions: (1) I didn't understand why it's necessary and efficient to reduce the dimension. Is it for regularization? Is it for interpretation purposes? Or reducing the dimension can help researchers make sense of results and decide whether it is doing the right thing, like "human supervision"? (2) I applied LDA to perform topic modeling before, and the results were not very satisfying. The LDA simply picks some high-frequency words and the topics it gave rarely make sense. But in this paper, the authors relied on topic modeling to "generate representations that predict the treatment and outcomes well". Is it because they were using a more advanced algorithm?
I don't think I understand how causal text embeddings are different from word embeddings. I understand that they talk about causal BERT by adding a linear mapping and two layers of neural net, however, what exactly is happening under the hood when we talk about causal text embeddings being different from word embedding models?
+1 would appreciate detailed explanations of what's under the hood! My not-very-informed view is that they adjust objective functions of embedding models - so we are not seeking good representation of all raw texts, we are seeking good representation of causally relevant part of the raw texts! Also, the authors mention "the black box nature of the embedding methods makes it difficult for practitioners to assess whether the causal assumptions hold." What are ways that we may assess those assumptions?
++1 ! my understanding is that structurally they're not much different but that the difference is in the content and thus utility, crafted specifically for downstream prediction tasks - closest analogy I can think of is feature selection in supervised learning (though I could be understanding this completely wrong). My other question is something the paper has already brought up - I'm interested in applying this to estimating how specific, word-level, linguistic use can be isolated as a treatment for effects. And I'm thinking both semantically and syntactically.
+++1!! As far as I know, BERT has multiple layers of transformer encoders and each output from each layer can be used a word embedding. The vector representations for the same word would be context-dependent in BERT whereas word embedding could have only one single word representation. My question also related to "black box nature of the embedding methods makes it difficult for practitioners to assess whether the causal assumption hold"; I left wondering how to validate the causal relationship for text?
The authors mentioned that they modified BERT to minimize an unsupervised objective - predicting their identities. What is the intuition behind that?
What is the cost of causally sufficient embeddings? I remember that it’s really time-consuming to train Doc2Vec. The authors says that they’re low-dimensional document representations. So I’m quite interested in the cost of training it(or, say, fine-tuning from BERT)
I've been slowly making my through The Book of Why by Judea Pearl, who is cited in this paper. In the book he explains how he developed the do() operator alongside causal diagrams to bring causality back into statistics. He also introduces the concepts of "mediators" and "colliders," which are also referenced in this paper. My question is how fundamental are these concepts to causal inference in the social sciences? Am I likely to start noticing Pearl's notation and terminology everywhere now that I know what to look for?
I wanted to learn more about the direction in which causal inference research is headed currently. Were there other novel approaches to causal inference after Bayesian networks (Judea Pearl) and before using word embeddings? It is fascinating to see the opposite directions the research is headed - The methods in this paper (which was published fairly recently) relies on reduced dimensionality and using contextual embeddings and language models. On the other hand, transformer based language models like GPT2 and GPT3 which are built on more than 100 billion parameters embrace the complexity and does not require to be explicitly provided with additional language data. Both seem to perform fairly well. Have there been studies comparing these approaches? How does one decide which model to go with? Is one model preferable over the other for any particular task?
I think it would be good to give an short introduction about neural network and contextual embedding, along with the training process of BERT.
echo with @pranathiiyer, I don't quite get the logic of using causal text embeddings. Hope to have more elaborations on it and why the embedding vectors can presumably capture the confounding part of the text?
If I'm understanding the paper correctly, the authors depend extensively on language modeling (word embeddings and topic modeling). They show that adding those two factors into the model will increase the performance. However, by personally trying out word embeddings and topic modeling, I believe the results need lots of human interpretation and sense-making. Also, we need to tune the model for each dataset. I theoretically understand their points but probably need to learn more about the mathematical theories behind the models.
I would also like to learn more explanations on how a multi-layer, black-box model can help us to discern the treatment effect, and how can we know which specific linguistic features contribute to the gender bias or paper acceptance?
Very interesting read, I agree that the math is a bit hard to follow. Going off with what Emily says above I am also curious about how to interpret the before and after of the adjustments they have made to their language modeling methods.
I'm also curious about how causal embedding differentiate with normal word embedding? Why this embedding method can show the causal relationship?
Would appreciate further clarification on the intuition behind causal BERT and the interpretability of such causal embedding methods.
This is a really inspiring paper that can address causal effect of social science questions with textual data. I have questions about potential confounders: how do the authors ensure to include all substantial confounders so as to address the endogeneity problem? And how do they test the effect is causal after including the confounders into models?
This is a very interesting read! (and mathy). I would love to know more about causal text embeddings.
Very interesting ideas! I really like the part where the authors attempt to have more convincing outcomes by utilizing word embedding methods. I have a question on how we could differentiate a method in the paper from other types of method such as regression models or RNN.
Post questions here for this week's oritenting readings: Veitch, Victor, Dhanya Sridhar & David M. Blei. 2020. “Adapting Text Embeddings for Causal Inference.” Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR volume 124, 2020.