Poor multithreading performance with very small FE groups

droodman commented 6 months ago

When I released reghdfejl, a user wrote to me with an example of it being 2-3X slower than reghdefe. In the example, ~90% of the sample is just dropped as singletons, while the rest consists mostly of near-singletons. It seems extreme, but it's in a published article.

I find setting nthreads=1 halves the run time. But it could be that a more nuanced combination of single- and multi-threading would be even faster, and serve users by preventing them from stumbling into an inefficient command specification. E.g., if the program used a crude formula to estimate run time with and without multithreading in the particular places where multithreading can bigly backfire, and then made the (hopefully) best choice automatically.

Example, using this replication data:

using DataFrames, StatFiles, StatsModels, FixedEffectModels
df = DataFrame(load("pseudoreg.dta"))
df.lntot_l = log.(df.tot_size_l)
@time reg(df, @formula(delay ~ bkshare_l+ bad_l +cap_l+ liq_l +stab_l+ gov_l+ lntot_l + fe(pseudofirm)&fe(date) + fe(pseudofirm)&fe(pseudobank)), Vcov.cluster(:pseudobank), nthreads=6);
@time reg(df, @formula(delay ~ bkshare_l+ bad_l +cap_l+ liq_l +stab_l+ gov_l+ lntot_l + fe(pseudofirm)&fe(date) + fe(pseudofirm)&fe(pseudobank)), Vcov.cluster(:pseudobank), nthreads=1);

I get 18 seconds for the first, on a machine with 6 performance cores, and 9.6 seconds for the second.

In this example, it looks to me like a lot of the time cost when multithreading is in gather!(), but I'm not certain. At the least, when gather!() is single threaded, the sums would not need to be stored in tmp and then added to fecoef. They could just be added directly to fecoef, I think.

(Also would it help to drop multiplication by α since it's always 1? I know this would violate the spirit of defining 5-arg mul!(), but it's an extra operation in the costliest part of the code.)

matthieugomez commented 6 months ago

Yes multithreading started being noticeably slower with Julia 1.10 (even slower than single threaded). I don't really understand why. I may switch to single threaded by default.

matthieugomez commented 6 months ago

Ok I switched to single threaded for gather! in https://github.com/FixedEffects/FixedEffects.jl/pull/64 and it seems to improve performances. Could you test it on your computer? I also created special paths for alpha = 1 just in case — I'm not sure it does anything.

droodman commented 6 months ago

Nice. Here it is with nthreads=6 and nthreads=1. The gap is mostly closed.

julia> @time reg(df, @formula(delay ~ bkshare_l+ bad_l +cap_l+ liq_l +stab_l+ gov_l+ lntot_l + fe(pseudofirm)&fe(date) + fe(pseudofirm)&fe(pseudobank)), Vcov.cluster(:pseudobank), nthreads=1);
  8.652250 seconds (67.73 k allocations: 440.059 MiB, 0.14% gc time)

julia> @time reg(df, @formula(delay ~ bkshare_l+ bad_l +cap_l+ liq_l +stab_l+ gov_l+ lntot_l + fe(pseudofirm)&fe(date) + fe(pseudofirm)&fe(pseudobank)), Vcov.cluster(:pseudobank), nthreads=6);
  8.237469 seconds (68.44 k allocations: 440.142 MiB, 0.28% gc time)

However, it does somewhat slow the toughest examples in FixedEffectModels.jl's benchmark.jl. My run time for the last example in that file increased from 1.229923 seconds (5.06 k allocations: 1.381 GiB, 2.64% gc time) to 1.517461 seconds (3.63 k allocations: 1.358 GiB, 2.28% gc time)

FixedEffects / FixedEffectModels.jl

Poor multithreading performance with very small FE groups #262