This PR tweaks drop_singletons!() for speed, mainly by:
Multi-threading the second for loop, which drops the singletons
Stopping the count of observations in each FE group at 2, and storing the counters in bytes, which reduces allocations and probably uses CPU caches more efficiently
This is my 2nd try at this PR; I removed my Manifest.toml.
On the last example in benchmark.jl, on my late-model Windows laptop with 6 performance cores and Julia's nthreads=6, I'm getting run times of about 1.4s instead of 1.45s with this change (using @btime for timings).
Thanks. I will merge it but I will remove the multithreaded part — I can't believe that this accelerates anything since the computation done at each index is so small
This PR tweaks drop_singletons!() for speed, mainly by:
This is my 2nd try at this PR; I removed my Manifest.toml.
On the last example in benchmark.jl, on my late-model Windows laptop with 6 performance cores and Julia's nthreads=6, I'm getting run times of about 1.4s instead of 1.45s with this change (using
@btime
for timings).