Closed droodman closed 5 months ago
Yes multithreading started being noticeably slower with Julia 1.10 (even slower than single threaded). I don't really understand why. I may switch to single threaded by default.
Ok I switched to single threaded for gather! in https://github.com/FixedEffects/FixedEffects.jl/pull/64 and it seems to improve performances. Could you test it on your computer? I also created special paths for alpha = 1 just in case — I'm not sure it does anything.
Nice. Here it is with nthreads=6 and nthreads=1. The gap is mostly closed.
julia> @time reg(df, @formula(delay ~ bkshare_l+ bad_l +cap_l+ liq_l +stab_l+ gov_l+ lntot_l + fe(pseudofirm)&fe(date) + fe(pseudofirm)&fe(pseudobank)), Vcov.cluster(:pseudobank), nthreads=1);
8.652250 seconds (67.73 k allocations: 440.059 MiB, 0.14% gc time)
julia> @time reg(df, @formula(delay ~ bkshare_l+ bad_l +cap_l+ liq_l +stab_l+ gov_l+ lntot_l + fe(pseudofirm)&fe(date) + fe(pseudofirm)&fe(pseudobank)), Vcov.cluster(:pseudobank), nthreads=6);
8.237469 seconds (68.44 k allocations: 440.142 MiB, 0.28% gc time)
However, it does somewhat slow the toughest examples in FixedEffectModels.jl's benchmark.jl. My run time for the last example in that file increased from 1.229923 seconds (5.06 k allocations: 1.381 GiB, 2.64% gc time) to 1.517461 seconds (3.63 k allocations: 1.358 GiB, 2.28% gc time)
When I released
reghdfejl
, a user wrote to me with an example of it being 2-3X slower thanreghdefe
. In the example, ~90% of the sample is just dropped as singletons, while the rest consists mostly of near-singletons. It seems extreme, but it's in a published article.I find setting
nthreads=1
halves the run time. But it could be that a more nuanced combination of single- and multi-threading would be even faster, and serve users by preventing them from stumbling into an inefficient command specification. E.g., if the program used a crude formula to estimate run time with and without multithreading in the particular places where multithreading can bigly backfire, and then made the (hopefully) best choice automatically.Example, using this replication data:
I get 18 seconds for the first, on a machine with 6 performance cores, and 9.6 seconds for the second.
In this example, it looks to me like a lot of the time cost when multithreading is in gather!(), but I'm not certain. At the least, when gather!() is single threaded, the sums would not need to be stored in
tmp
and then added tofecoef
. They could just be added directly tofecoef
, I think.(Also would it help to drop multiplication by α since it's always 1? I know this would violate the spirit of defining 5-arg mul!(), but it's an extra operation in the costliest part of the code.)