Closed JNing0 closed 3 years ago
Hi Jie, I am not sure why this would be an issue? Do we need the dependent variable to be normally distributed? My understanding is that the normality assumption is usually applied to the error term.
Moreover, we are not really interested in predicting individual Y_i's (delays in our setting), but rather in understanding how their distribution changes because of the treatment (quickpay). Section 3.1.2 of Mostly Harmless Econometrics discusses this in detail. See, for example, the following excerpt from page 29:
Regression-CEF theorem tells us that even if the CEF is nonlinear, regression provides the best linear approximation to it. The regression-CEF theorem is our favorite way to motivate regression. The statement that regression approximates the CEF lines up with our view of empirical work as an effort to describe the essential features of statistical relationships, without necessarily trying to pin them down exactly.
Given that the majority of the dependent variables (>70%) are identically zero, it is quite unlikely that the error terms are normally distributed. In the linear regressions that you've run, are the error terms normally distributed?
We rely on regression results to test our hypotheses. If our data violates the most basic assumption of linear regression, then I don't know how much I can trust the results. The regression results may be highly affected by the few nonzero observations.
This is not the end of the world. We just need to find the right statistical model for our data. One possibility is to do a multinomial logistic regression and consider the probabilities that the project does not delay, delay, and expedite in a quarter.
I'll think more about it and let's discuss during our meeting on Tuesday.
I referred to some of my old notes and textbooks, and sharing a few thoughts:
I take it from your reply that the residuals in the linear regressions are not normally distributed?
One can use OLS to find the mean and variance of coefficient estimates. But to compute p values, a distribution assumption is needed.
Excellent work, Jie! In retrospect, we should have expected this, because changes in delays should not be happening frequently. But seeing the data, testing the assumptions is really motivating. Shall we plot a histogram of residuals to see how non-normal they look?
What should we do about this? I am not an econometrics expert. All I know is from the textbooks and this issue did not come up in my previous projects. From textbooks, what Vibhuti is saying resonates with me. Normality confers some desirable properties and using robust errors and clustering errors in the case of serial correlation is typically all that is needed when residuals are not exactly normal.
But I do not know how non-normal should residuals become before this stops working. For example, when the outcome variable is binary, OLS is not the best approach (even though it is the best linear estimate).
I googled OLS with many zero observations and this paper came up: https://www.journals.uchicago.edu/doi/10.1086/701235
Might be helpful?
I take it from your reply that the residuals in the linear regressions are not normally distributed?
My understanding is that the normality assumption applies to the unobserved error term and not to the observed residuals.
I also found this well-cited paper that discusses the issue in a relatively non-technical way. I haven't read it carefully yet but might be useful: https://www.annualreviews.org/doi/pdf/10.1146/annurev.publhealth.23.100901.140546
I read the above paper and other references in more detail. The main message, as I understand, is that violation of normality assumption is not an issue for inference as long as the sample size is large enough (because CLT kicks in for any distribution). Here's an excerpt from Wooldrige's introductory text (taken from page 182 here):
Thank you for the papers, Vlad!
As for the residuals and errors, you are right, there is a difference. In a nutshell, residuals are estimates for errors. When are they good estimates? In the extreme case where we know the data generating mechanism, e.g., in a simulation, then our residuals are exactly the errors as we know the true model. In general, if our model is correct, e.g., controls for all important factors and does not leave not important confounding factors, then our residuals are good estimates. One way to check is to examine the association between each covariates and the errors. There should not be any dependence of the residuals on a covariate.
As for the public health paper and the Wooldrige introductory text, the key point is "central limit theorem" and "approximately valid". How good is the approximation? What sample size is large enough for central limit theorem to kick in? The answer varies from data set to data set. The public health paper provides one example where linear regression provides good approximates. It is not a proof. Our data is vastly different from the data set in that paper. If we want to establish that linear regression is a good approximation, I think we'll need to do the same thing in the public health paper: fit several models and show that linear regression is indeed a good approximation.
In fact, the public health paper cited on p. 157 that "[Normality] is not necessary for least-squares fitting of the regression model but it is required in general for inference making..." What we do in our paper is exactly inference making. The p values carry substantial significance in whether we test our hypotheses. Given that the majority of our dependent variables are identical, if our data set large enough to rely on central limit theorem? What about the heteroskedasticity and serial correlation? How do they play into the OLS estimates for the coefficients and standard errors? How would one's resort to the central limit theorem? I honestly don't know.
The first paper points out some ways to deal with the issues we have. I'll read it carefully.
As I suspected, the Wooldridge excerpt is from the chapter that discusses asymptotic properties of OLS, i.e., when the sample size goes to infinity. So the fundamental question seems to be: given our data set with the majority of dependent obs. taking identical values, heteroskadasticity, auto-correlated errors, are we/would a reader be willing to take a leap of faith and trust that our sample size is large enough so that all our inferences based on normal distribution is correct?
Here is an excerpt from the same Wooldridge intro text that discusses statistical inference in finite samples:
I agree the fundamental question now is whether or not we are willing to accept that CLT applies with our sample size. As you mentioned, the answer is subjective and there's no one-size-fits-all approach. We can see what the consensus is during our meeting.
I am sharing my thoughts on the issue:
On page 176, they also briefly discuss the role of sample size:
Unfortunately, there are no general prescriptions on how big the sample size must be before the approximation is good enough. Some econometricians think that n=30 is satisfactory, but this cannot be sufficient for all possible distributions of u. Depending on the distribution of u, more observations may be necessary before the central limit theorem delivers a useful approximation. Further, the quality of the approximation depends not just on n, but on the df, n-k-1: With more independent variables in the model, a larger sample size is usually needed to use the t approximation.
As an extreme case, in the most basic DID model with just 3 regressors, we definitely have more than enough observations to justify CLT. I am pretty sure even with all the controls and fixed effects, we have degrees of freedom well above the thresholds discussed in these references.
I may be missing something but personally I haven't come across any mainstream applied empirical work that attempts to justify normality in the data.
As for heteroskedascity and serial correlation, we address them by using clustered standard errors in the regression. I am sure the underlying methods are not 100% accurate either, but (as far as I know) it is the standard way of doing things in the literature and I am not aware of alternative approaches.
Perhaps from my ignorance (which is bliss), but I do lean towards not being too worried about non-normality. Our sample is large, we are using robust errors, we are clustering errors. So, hopefully the estimates are unbiased and significance calculations are robust.
But, if we can find ways to test robustness, that would help. Perhaps use a different outcome variable that is more normal.
As for the literature on such zero-inflated data set, the first paper posted by Vlad provides a useful lead.
The Wooldridge examples are not directly comparable to what we have. They have no heteroskedasticity or serial correlation. The book claims that its sample size of ~2000 is enough but provides no evidence of it as the public health paper. Even if one believes it, how would heteroskedasticity or serial correlation, among other things in our data, affect the threshold? After controlling for heteroskedasticity and serial correlation, do we truly have iid residuals, as is required in the central limit theorem?
Aside from those two examples, Wooldridge does not provide a definitive threshold for asymptotics in general settings. The discussion is mostly qualitative. Realistically, no one can provide such a definitive threshold. In fact, the first paper posted by Vlad shows how a Tobit regression yields significantly different result from a linear regression, with a sample size that arguably satisfies the heuristics in Wooldridge.
So we are back to the leap-of-faith question. But I sense that I am probably the only person who is reluctant to do so. Personally, I would like to at least test the robustness of the results using different methods, as in the public health paper. But I am not sure if I/we have time for that. So I am fine with continuing with linear regression, if the everyone else is. We are in a democracy after all, aren't we?
Hi everyone,
I ran a simple Tobit regression to see if our baseline results are robust to it -- please see here. The estimates are still positive and statistically significant at the 1% level. I set the parameters to be "left censored" at zero, and also specified the cluster at project level. There was an option for specifying distribution to fit the data, and I used "gaussian" and "logistic". The estimates are similar in the two cases. I tried adding fixed effects but it didn't go through, not sure why.
As such, I don't (yet) know how to interpret the results from this model. Chapter 16 of Wooldrige's book on panel data discusses this in great detail, but it's complex to say the least. Some broad takeaways from my (very) preliminary understanding:
Interpretation of regression coefficients in the Tobit model also differs from the linear regression case. In linear regression, coefficients represent the amount of change in the dependent variable when the independent variable changes (holding other model variables constant). In the Tobit model, three conceptualizations of the dependent variable may be of interest: (a) the latent variable (y*), (b) the uncensored (i.e., nonzero) observed y values, and (c) the probability of obtaining noncensored positive values for y.
One consequence of the different approach to estimation is that the Tobit model is not as robust to violations of its assumptions (McBee, 2010) as is linear regression.
Hopefully this helps assuage at least some concerns about robustness, but I am not sure if/why this would be better than OLS.
Thanks, Vibhuti! Just finished my teaching this week (exam week). Let me go over it carefully and provide my feedback.
I looked into the quarterly delays with nonzero values and plotted the histogram after truncating the lower 2.5% and upper 97.5% of the data to remove outliers (somehow I cannot install the package that does winsorization).
As you can see from the histogram (and I verified in the data), the quarterly delay is pretty continuous. It has negative numbers (meaning expedition in days) that increase continuously to -1, and positive numbers (meaning delay in days) that grow continuously from one. This means that there is no censoring, e.g., a contractor does not report a delay/expedition unless it is over a threshold. The abundant number of zeros in our data simply reflects the (intuitive) fact that delay/expedition does not happen on a regular basis.
Since we don't have censoring, I don't think Tobit model is a good fit for our data. Also we have negative delays, so we cannot use models for count data such as zero-inflated Poisson.
Given the uniqueness of our data set, here are my thoughts for some preliminary robustness checks:
I will work on this and share the results in our next meeting.
closing the issue as it has been resolved.
I have been working on the regression and found that we have lots of zeros for delay. That is, over our observation horizon, the quarterly delay of most projects at most quarters is zero. I am not sure we can do a linear regression with this data, as the Gaussian assumption is violated...