Analysis on delay/expedition probabilities and delay magnitudes

JNing0 commented 3 years ago

Hi all, please see some preliminary analysis based on our earlier discussion here.

Please leave your comments and let's discuss on our meeting this Wednesday. Thanks!

vob2 commented 3 years ago

Great! Thank you, Jie.

Question: In figure 1, what is the value of delay rate that is the most popular: around 360. Is it 365?

Everyone, I wonder how we should treat this. Is the delay really 365 days or the program officer puts 1 year whenever the actual delay is ambiguous and adjusts it later.

JNing0 commented 3 years ago

Good point, Vlad. As you guessed, the highest bar in the histogram corresponds to quarterly delays between 360-370 days.

Our regression model is able to fully capture that abnormal behavior in [360, 370] and the histogram of the residuals are pretty normal (see figure below). This means that the high frequency of quarterly delays within the [360, 370] bracket is picked up by our model estimates. The question is whether it is picked up by the coefficient of interest, namely, the treatment effect of QP. If so, then we need to interpret the results on the delay magnitudes with caution, as it may be inflated by this "rounding" behavior. In other words, the treatment effect of QP on delay magnitudes are on magnitudes of "projected delays" and not necessarily actual delay.

On the bright side, the logistic model that looks at delay probability is unaffected by this behavior. So that result would be trustworthy.

vob2 commented 3 years ago

Great. Yes, this makes sense

vob2 commented 3 years ago

I rewrote equation (7) slightly differently, combining terms. Here is the result:

This looks more logical to me. But I have several questions:

Are we missing \beta_0 terms (marked with ??)?
Why do we have both Stage_{it} and dummies for stages? Should we be consistent and write \beta_0+MS+TS, to make it parallel with other terms?
Are we missing a term in line 3 (marked with ??). Again, logically it should be there.

JNing0 commented 3 years ago

Hi Vlad, your equation is the same as equ. (7). Note that \mu_t is the fixed time effect for each quarter, so we don't need to have Post_t (the ?? in line 3) and \beta_0 (the ?? in line 1). About Stage_it, it is the same as having MS and TS, I was being lazy...

On Mon, Oct 11, 2021, 5:33 PM vob2 @.***> wrote:

I rewrote equation (7) slightly differently, combining terms. Here is the result:

[image: image] https://user-images.githubusercontent.com/58700331/136857942-cd0bd82b-584b-45b9-82e7-65a59bb06bc9.png

This looks more logical to me. But I have several questions:

Are we missing \beta_0 terms (marked with ??)?

Why do we have both Stage_{it} and dummies for stages? Should we be consistent and write \beta_0+MS+TS, to make it parallel with other terms?

Are we missing a term in line 3 (marked with ??). Again, logically it should be there.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/QuickPay-Operational-Performance/Data-and-code/issues/75#issuecomment-940458277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOBDD5WVWMHIA6WT7ADWDBLUGNJY7ANCNFSM5FXD7EAA .

vibhuti6 commented 3 years ago

Thanks, Jie. A couple of minor questions:

Do we know the number of observations, R^2, and adjusted R^2 for the regression tables?
Do we have the results for the basic model with Post_t instead of time fixed effects? We are unlikely to use it (in the final paper) but I just find it easier to interpret the results with that.

JNing0 commented 3 years ago

Hi Vibhuti, the number of observations for the linear regression is as stated in the file: 34,413. The sample size of the logistic regression is around 322,000. (I did not control for many things such as initial budget, so these regressions have a larger sample size.)

Please see below for the fitting results for the linear models:

Measuring goodness-of-fit of logistic regressions is more involved and I don't have time for it before our meeting. Probably afterwards.

vibhuti6 commented 3 years ago

Great, thank you!

vibhuti6 commented 3 years ago

Hi all,

I have a question about our conversation yesterday regarding the 365 days spike that we see in the data. If I understood correctly, the concern was that large projects could also experience high delays (of say more than one year) but it gets rounded to 365 days in the data. The same issue is less likely to affect small projects because they are generally of a shorter duration. This means that we are underestimating the delays on large projects, which may make the treatment effect appear artificially large.

I am not sure of the last bit about treatment effect. The baseline difference in delays between large and small projects should be captured by the Treat_i term in the regression. QuickPay only applied to small projects so I am not clear on why it would bias the (Treat_i x Post_t) term. Just wondering if I am missing something here?

vob2 commented 3 years ago

Hi Vibhuti. Yes, that is exactly the concern. Large projects have observations censored at 1 year (more likely). So suppose that QP law caused both large and small projects to increase by 10 days. We will see this for small projects because their competition was < 1 year - 18 days, but will not see this for large projects because before QP their delay was >1 year and after QP their delay was > 1 year, so nothing appears to change.

Our Treat*Post coefficient will be 10 days, but only because we did not observe that Post QP large projects also increased.

vob2 commented 3 years ago

Hi all,

I chatted to my friend and co-author, who is good at empirical research, about having significant fractions of zero observations. He had several suggestions. One is to do conditional analysis. That sounds similar to what Jie did. Present probabilities of delays, and probabilities of expediting, show that they are significantly non-zero. Analyze delays, conditional on having a delay and similar to expediting.

Another suggestion is to fit multinomial logit. Treat all observations as categorical values (or create bins to aggregate values, in our case, we can do monthly or quarterly---this will also partially address the issue with bunching of observations around 1 quarter, 1/2 year, 1 year). We are giving up the ability to interpret delays as numbers, but we can tell if non-zero observations are significantly different from zero observations.

Just sharing, not sure I like the second approach, but perhaps this will spark some other ideas.

JNing0 commented 3 years ago

Our data has answers to most of the conjectures.

The table below breaks down the small vs. non-small business projects with delays from 363 days to 368 days. As you can see, the fractions of small and non-small businesses at the peaks, i.e., delay=365 and 366 days, are comparable to their fractions when delay is way below and above 365 days. This suggests that the small- and non-small business projects do not exist significant difference in reporting and "rounding" their delays.

The next table is the regression results without delay records that equal 365 and 366 days. Comparing it with Table 5 above, we can see that our estimate is quite robust.

JNing0 commented 3 years ago

Thanks for sharing, Vlad. I also prefer the first method.

Frankly, I prefer this method to a simple linear regression that lumps all data together. I find it to be more solid statistically. More importantly, it reveals more about the mechanism than the simple average. We now know that QP increases the delay in two ways: (1) by increasing the probability that a project delays, and (2) by increasing the magnitude when a project does delay. In other words, under QP, more projects report a delay and the reported delays are also longer than without QP. So with this method, we can tell a richer story and offer a better understanding of the impact of QuickPay law that exploits the uniqueness of our context.

I may again be the minority here, but just thought that I'd share it with the group.

QuickPay-Operational-Performance / Data-and-code

Analysis on delay/expedition probabilities and delay magnitudes #75