Optimize drop_singletons!()

This PR tweaks drop_singletons!() for speed, mainly by:

Multi-threading the second for loop, which drops the singletons
Stopping the count of observations in each FE group at 2, and storing the counters in bytes, which reduces allocations and probably uses CPU caches more efficiently

This is my 2nd try at this PR; I removed my Manifest.toml.

On the last example in benchmark.jl, on my late-model Windows laptop with 6 performance cores and Julia's nthreads=6, I'm getting run times of about 1.4s instead of 1.45s with this change (using @btime for timings).

FixedEffects / FixedEffectModels.jl

Optimize drop_singletons!() #260