Question on delay - Githubissues

JNing0 commented 3 years ago

I have been working on the regression and found that we have lots of zeros for delay. That is, over our observation horizon, the quarterly delay of most projects at most quarters is zero. I am not sure we can do a linear regression with this data, as the Gaussian assumption is violated...

vibhuti6 commented 3 years ago

Hi Jie, I am not sure why this would be an issue? Do we need the dependent variable to be normally distributed? My understanding is that the normality assumption is usually applied to the error term.

Moreover, we are not really interested in predicting individual Y_i's (delays in our setting), but rather in understanding how their distribution changes because of the treatment (quickpay). Section 3.1.2 of Mostly Harmless Econometrics discusses this in detail. See, for example, the following excerpt from page 29:

Regression-CEF theorem tells us that even if the CEF is nonlinear, regression provides the best linear approximation to it. The regression-CEF theorem is our favorite way to motivate regression. The statement that regression approximates the CEF lines up with our view of empirical work as an effort to describe the essential features of statistical relationships, without necessarily trying to pin them down exactly.

JNing0 commented 3 years ago

Given that the majority of the dependent variables (>70%) are identically zero, it is quite unlikely that the error terms are normally distributed. In the linear regressions that you've run, are the error terms normally distributed?

We rely on regression results to test our hypotheses. If our data violates the most basic assumption of linear regression, then I don't know how much I can trust the results. The regression results may be highly affected by the few nonzero observations.

This is not the end of the world. We just need to find the right statistical model for our data. One possibility is to do a multinomial logistic regression and consider the probabilities that the project does not delay, delay, and expedite in a quarter.

I'll think more about it and let's discuss during our meeting on Tuesday.

vibhuti6 commented 3 years ago

I referred to some of my old notes and textbooks, and sharing a few thoughts:

Normality of error term is not required for OLS to be the "best linear unbiased estimator" (see Gauss-Markov theorem).
It does, however, require homoskedasticity and no serial correlation. These two assumptions are clearly violated in our setting (as discussed in this seminal paper). We adjust for them by using cluster-robust standard errors.
In simple OLS, normality of error terms is only used for inference together with the assumptions of homoskedasticity and serial uncorrelation. Based on my reading so far, the methods that adjust for heteroskedasticity and serial correlation don't require normality. I am still looking into it though.
I am, of course, more than happy to run alternative specifications for robustness. But if we are thinking of changing the main model itself, I would be cautious about whether it would give us unbiased and efficient estimates.
If the alternative model also has issues with homoskedasticity and serial correlation, is there a way to adjust the standard errors? Linear models are most widely used in empirical work so there are packages to address these issues in a straightforward way.

JNing0 commented 3 years ago

I take it from your reply that the residuals in the linear regressions are not normally distributed?

One can use OLS to find the mean and variance of coefficient estimates. But to compute p values, a distribution assumption is needed.

vob2 commented 3 years ago

Excellent work, Jie! In retrospect, we should have expected this, because changes in delays should not be happening frequently. But seeing the data, testing the assumptions is really motivating. Shall we plot a histogram of residuals to see how non-normal they look?

What should we do about this? I am not an econometrics expert. All I know is from the textbooks and this issue did not come up in my previous projects. From textbooks, what Vibhuti is saying resonates with me. Normality confers some desirable properties and using robust errors and clustering errors in the case of serial correlation is typically all that is needed when residuals are not exactly normal.

But I do not know how non-normal should residuals become before this stops working. For example, when the outcome variable is binary, OLS is not the best approach (even though it is the best linear estimate).

I googled OLS with many zero observations and this paper came up: https://www.journals.uchicago.edu/doi/10.1086/701235
Might be helpful?

vibhuti6 commented 3 years ago

I take it from your reply that the residuals in the linear regressions are not normally distributed?

My understanding is that the normality assumption applies to the unobserved error term and not to the observed residuals.

I also found this well-cited paper that discusses the issue in a relatively non-technical way. I haven't read it carefully yet but might be useful: https://www.annualreviews.org/doi/pdf/10.1146/annurev.publhealth.23.100901.140546

vibhuti6 commented 3 years ago

I read the above paper and other references in more detail. The main message, as I understand, is that violation of normality assumption is not an issue for inference as long as the sample size is large enough (because CLT kicks in for any distribution). Here's an excerpt from Wooldrige's introductory text (taken from page 182 here):

Screen Shot 2021-10-04 at 7 35 57 PM

JNing0 commented 3 years ago

Thank you for the papers, Vlad!

As for the residuals and errors, you are right, there is a difference. In a nutshell, residuals are estimates for errors. When are they good estimates? In the extreme case where we know the data generating mechanism, e.g., in a simulation, then our residuals are exactly the errors as we know the true model. In general, if our model is correct, e.g., controls for all important factors and does not leave not important confounding factors, then our residuals are good estimates. One way to check is to examine the association between each covariates and the errors. There should not be any dependence of the residuals on a covariate.

As for the public health paper and the Wooldrige introductory text, the key point is "central limit theorem" and "approximately valid". How good is the approximation? What sample size is large enough for central limit theorem to kick in? The answer varies from data set to data set. The public health paper provides one example where linear regression provides good approximates. It is not a proof. Our data is vastly different from the data set in that paper. If we want to establish that linear regression is a good approximation, I think we'll need to do the same thing in the public health paper: fit several models and show that linear regression is indeed a good approximation.

In fact, the public health paper cited on p. 157 that "[Normality] is not necessary for least-squares fitting of the regression model but it is required in general for inference making..." What we do in our paper is exactly inference making. The p values carry substantial significance in whether we test our hypotheses. Given that the majority of our dependent variables are identical, if our data set large enough to rely on central limit theorem? What about the heteroskedasticity and serial correlation? How do they play into the OLS estimates for the coefficients and standard errors? How would one's resort to the central limit theorem? I honestly don't know.

The first paper points out some ways to deal with the issues we have. I'll read it carefully.

JNing0 commented 3 years ago

As I suspected, the Wooldridge excerpt is from the chapter that discusses asymptotic properties of OLS, i.e., when the sample size goes to infinity. So the fundamental question seems to be: given our data set with the majority of dependent obs. taking identical values, heteroskadasticity, auto-correlated errors, are we/would a reader be willing to take a leap of faith and trust that our sample size is large enough so that all our inferences based on normal distribution is correct?

JNing0 commented 3 years ago

Here is an excerpt from the same Wooldridge intro text that discusses statistical inference in finite samples:

vibhuti6 commented 3 years ago

I agree the fundamental question now is whether or not we are willing to accept that CLT applies with our sample size. As you mentioned, the answer is subjective and there's no one-size-fits-all approach. We can see what the consensus is during our meeting.

I am sharing my thoughts on the issue:

In the Wooldridge chapter on Asymptotic properties of OLS, they give examples with sample sizes of ~2,000 to argue that CLT would hold. I have included a screenshot below, and pages 174-176 have the full discussion. Our data is less skewed than this, and our sample size is orders of magnitude higher.

Screen Shot 2021-10-04 at 9 53 14 PM

On page 176, they also briefly discuss the role of sample size:

Unfortunately, there are no general prescriptions on how big the sample size must be before the approximation is good enough. Some econometricians think that n=30 is satisfactory, but this cannot be sufficient for all possible distributions of u. Depending on the distribution of u, more observations may be necessary before the central limit theorem delivers a useful approximation. Further, the quality of the approximation depends not just on n, but on the df, n-k-1: With more independent variables in the model, a larger sample size is usually needed to use the t approximation.

As an extreme case, in the most basic DID model with just 3 regressors, we definitely have more than enough observations to justify CLT. I am pretty sure even with all the controls and fixed effects, we have degrees of freedom well above the thresholds discussed in these references.
I may be missing something but personally I haven't come across any mainstream applied empirical work that attempts to justify normality in the data.

As for heteroskedascity and serial correlation, we address them by using clustered standard errors in the regression. I am sure the underlying methods are not 100% accurate either, but (as far as I know) it is the standard way of doing things in the literature and I am not aware of alternative approaches.

vob2 commented 3 years ago

Perhaps from my ignorance (which is bliss), but I do lean towards not being too worried about non-normality. Our sample is large, we are using robust errors, we are clustering errors. So, hopefully the estimates are unbiased and significance calculations are robust.

But, if we can find ways to test robustness, that would help. Perhaps use a different outcome variable that is more normal.

JNing0 commented 3 years ago

As for the literature on such zero-inflated data set, the first paper posted by Vlad provides a useful lead.

The Wooldridge examples are not directly comparable to what we have. They have no heteroskedasticity or serial correlation. The book claims that its sample size of ~2000 is enough but provides no evidence of it as the public health paper. Even if one believes it, how would heteroskedasticity or serial correlation, among other things in our data, affect the threshold? After controlling for heteroskedasticity and serial correlation, do we truly have iid residuals, as is required in the central limit theorem?

Aside from those two examples, Wooldridge does not provide a definitive threshold for asymptotics in general settings. The discussion is mostly qualitative. Realistically, no one can provide such a definitive threshold. In fact, the first paper posted by Vlad shows how a Tobit regression yields significantly different result from a linear regression, with a sample size that arguably satisfies the heuristics in Wooldridge.

So we are back to the leap-of-faith question. But I sense that I am probably the only person who is reluctant to do so. Personally, I would like to at least test the robustness of the results using different methods, as in the public health paper. But I am not sure if I/we have time for that. So I am fine with continuing with linear regression, if the everyone else is. We are in a democracy after all, aren't we?

vibhuti6 commented 3 years ago

Hi everyone,

I ran a simple Tobit regression to see if our baseline results are robust to it -- please see here. The estimates are still positive and statistically significant at the 1% level. I set the parameters to be "left censored" at zero, and also specified the cluster at project level. There was an option for specifying distribution to fit the data, and I used "gaussian" and "logistic". The estimates are similar in the two cases. I tried adding fixed effects but it didn't go through, not sure why.

As such, I don't (yet) know how to interpret the results from this model. Chapter 16 of Wooldrige's book on panel data discusses this in great detail, but it's complex to say the least. Some broad takeaways from my (very) preliminary understanding:

Tobit is used when there's censoring in the data. That is, true values are never observed for part of the population. I don't think that happens in our setting because all deadlines (whether extended or moved up) get recorded.
Estimates from OLS and Tobit are not directly comparable. Here's an excerpt from the paper that Vlad shared:

Interpretation of regression coefficients in the Tobit model also differs from the linear regression case. In linear regression, coefficients represent the amount of change in the dependent variable when the independent variable changes (holding other model variables constant). In the Tobit model, three conceptualizations of the dependent variable may be of interest: (a) the latent variable (y*), (b) the uncensored (i.e., nonzero) observed y values, and (c) the probability of obtaining noncensored positive values for y.
Tobit also assumes normality and homoskedasticity, and is estimated by maximum likelihood estimation. More importantly, the estimates are not consistent if either condition is violated, which is stronger than what OLS requires:

One consequence of the different approach to estimation is that the Tobit model is not as robust to violations of its assumptions (McBee, 2010) as is linear regression.
There are also apparently some fundamental issues with fixed effects Tobit model, e.g. see this paper

Hopefully this helps assuage at least some concerns about robustness, but I am not sure if/why this would be better than OLS.

JNing0 commented 3 years ago

Thanks, Vibhuti! Just finished my teaching this week (exam week). Let me go over it carefully and provide my feedback.

JNing0 commented 3 years ago

I looked into the quarterly delays with nonzero values and plotted the histogram after truncating the lower 2.5% and upper 97.5% of the data to remove outliers (somehow I cannot install the package that does winsorization).

As you can see from the histogram (and I verified in the data), the quarterly delay is pretty continuous. It has negative numbers (meaning expedition in days) that increase continuously to -1, and positive numbers (meaning delay in days) that grow continuously from one. This means that there is no censoring, e.g., a contractor does not report a delay/expedition unless it is over a threshold. The abundant number of zeros in our data simply reflects the (intuitive) fact that delay/expedition does not happen on a regular basis.

Since we don't have censoring, I don't think Tobit model is a good fit for our data. Also we have negative delays, so we cannot use models for count data such as zero-inflated Poisson.

Given the uniqueness of our data set, here are my thoughts for some preliminary robustness checks:

Run linear regressions on nonzero quarterly delays. The results would tell us the effect of QP on nonzero quarterly delays. That is, on average, does QP increase/decrease the extent of quarterly delay when a project does delay?
Run logistic regressions on the probabilities that a project delays or expedites. Here we'll use the full sample and divide the quarterly delay into three groups: zero delay, positive delay, and negative delay. The results would tell us the effect of QP on the probability that a project delays or expedites in a quarter.

I will work on this and share the results in our next meeting.

JNing0 commented 3 years ago

closing the issue as it has been resolved.

QuickPay-Operational-Performance / Data-and-code

Question on delay #74