Improved performance of arp-nuisance.R

mcaceresb commented 2 years ago

In practice the bisection in .vlo_vup_dual_fn takes a few dozen iterations, which is the main bottleneck of the function as far as I have been able to profile. We want to find c s.t. max_x {a x + b x c} = c. We can instead iterate by setting c' = - a x / (1 - b x). In practice this takes one or two iterations and speeds up computations by an order of magnitude (I think it is also more precise, since there is only one tolerance involved; with the bisection the tolerance for equality cannot be set too low).

EDIT: I had only benchmarked this in Stata when I made the PR; the speedup in R is a lot more modest. See my comment below. Unsure if the smaller speedup merits changing the internals.

mcaceresb commented 2 years ago

@asheshrambachan I originally made this PR after testing out this change in Stata. In R, however, the speed improvements are nowhere as dramatic (it's along the lines of ~20%, though some jobs ran up to 2x faster; however, this is far from the order of magnitude speedup I saw in Stata). So the increase is nowhere as large as I had thought. The upside is the Stata version runs faster than the R version, but unsure how to speed up the R version farther than this.

Unsure if this smaller speedup merits changing up the internals, but LMK what you think.

asheshrambachan commented 2 years ago

Thanks @mcaceresb!

To make sure I understand how these pull requests speed up the run time of arp-nuisance.R,

.compute_least_favorable_cv: rather than using purrr:map_dfr, you switch the vectorization to call apply over a pre-constructed vector of normal draws.
vlo_vup_dual_fn: rather than running the bisection search, you directly set the update step.

I think the even a 20% speed up is worthwhile -- this is potentially called for every run of arp_nuisance.R, and so if you're constructing confidence intervals via test-inversion over a grid, you're calling this function for each test.

mcaceresb commented 2 years ago

@asheshrambachan

.compute_least_favorable_cv: rather than using purrr:map_dfr, you switch the vectorization to call apply over a pre-constructed vector of normal draws.

I do do this, but this is not the source of the speedup. Rather, pre-computing all the normal draws is the main source of the speedup (a loop would have a similar speed, I think, with the pre-computed draws).

vlo_vup_dual_fn: rather than running the bisection search, you directly set the update step.

Yes, this usually (in testing) cuts the number of function evals by a factor of ~10 when that function is called. (The bisection does ~27 iterations and this usually does the trick in 1-3 iterations, as far as I've been able to tell.)

mcaceresb commented 2 years ago

@asheshrambachan I implemented Jon's idea of doing a bisection on the numerical derivative for the FLCI search, relying on convexity with respect to h (it's a particularly large speedup in cases where the minimum would be at the endpoints, and there's a fallback to the grid search in case something goes wrong). I did another quick round of benchmarks and I'm happier with the speedup so I've moved the pull into ready for review. LMK if you'd like me to post actual times for reference. (As with Stata, the CIs from this branch are not identical but they are very close.)

asheshrambachan commented 2 years ago

Thanks @mcaceresb! This is great. Could you post the times of the speed comparison for future reference?

Also, how large are the differences in CIs from this branch relative to the main? Could you post a comparison again for future reference?

mcaceresb commented 2 years ago

@asheshrambachan Running this file under each branch (master and this one) I get something like this:

There are no substantive differences, I don't think. While for the "large" event study some CIs are noticeably different, I'd be more inclined to think the current branch is more accurate (it's closer to the Stata version). Speed summary below (with current Stata for comparison; it's all non-parallel execution, and in one case R in this branch was faster than Stata):

BC

8 periods, 4 pre and 4 post.

Relative Magnitudes, deviation from parallel trends

Version	Master	Branch	%	Stata
(1)	382.2	109.2	28.6	47.7
(2)	46.1	18.8	40.7	10.1
(3)	58.1	12.0	20.6	4.7

(1) Defaults (2) 100 grid points, -1 to 1 bounds. (3) As (2) with method Conditional.

Non-rm

Version	Master	Branch	%	Stata
(1)	254.1	27.5	10.8	0.7
(2) FLCI	101.2	7.1	7.0	0.2
(2) Conditional	57.3	11.2	19.5	4.2
(2) C-F	155.1	18.1	11.6	4.8
(2) C-LF	27.9	7.0	25.1	3.6

(1) Defaults (2) M from 0 to 0.3 by 0.1, with indicated method

LW

32 periods, 9 pre and 23 post.

Non-rm

Version	Master	Branch	%	Stata
(1)	253.1	36.8	14.5	1.2
(2) FLCI	228.4	96.4	42.2	1.6
(2) Conditional	247.5	52.3	21.1	70.1
(2) C-F	483.0	174.4	36.1	70.2
(2) C-LF	165.4	45.8	27.7	46.49

(1) Defaults (2) M from 0 to 0.04 by 0.005, with indicated method

asheshrambachan commented 2 years ago

Thanks @mcaceresb (and apologies for the delay here)!

For the unit tests on comparing CI length across the master vs. this branch, could you add one more comparison as a benchmark? It would be useful to understand how the CIs differ it l_vec is specified such that the parameter of interest is the average causal effect in the post-period -- i.e., l_vec = rep(1, numPostPeriods)/numPostPeriods. Since that's a leading case we discuss in the paper, it would be useful to know if there are differences for that choice.

After we know this comparison, I'll merge this branch in the master.

mcaceresb commented 2 years ago

@asheshrambachan My code doesn't run all the way in either branch. The issue is here; it seems that without a timeout this the program can loop forever (or at least overnight). Hence this compares the current branch to a modified master branch that adds a timeout to that code.

With that note, I modified this code to do a basic check and a check where l_vec gets the average effect.

There are no substantive differences for the first l_vec (sanity check; same as last time) or for the average l_vec (new check). Can also compare Stata, full log if you like for good measure. Times below.

BC

8 periods, 4 pre and 4 post.

Relative Magnitudes, deviation from parallel trends

Version	Master	Branch	%	Stata
(1)	44.0	17.5	39.8	9.8
(2)	3.2	3.1	96.9	1.3

(1) 100 grid points, -1 to 1 bounds. (2) As (2) with method Conditional.

Non-rm

M from 0 to 0.3 by 0.1, with indicated method

Version	Master	Branch	%	Stata
FLCI	104.6	6.8	6.5	0.4
Conditional	2.9	3.0	103.4	1.3
C-F	110.1	11.5	10.4	1.4
C-LF	6.6	5.0	75.8	1.8

LW

32 periods, 9 pre and 23 post.

Non-rm

M from 0 to 0.04 by 0.005, with indicated method

Version	Master	Branch	%	Stata
FLCI	227.2	104.0	45.8	1.3
Conditional	16.4	17.5	106.7	16.2
C-F	243.3	138.6	57.0	23.2
C-LF	330.8	125.6	38.0	26.4

asheshrambachan / HonestDiD