perf: use multiple queues and non-blocking work stealing from them
perf: a few retries in the pool
pool: do some spinning in run, not in workers
this shows some real differences on test/t_tree_futs.exe at N=18 or 19.
hyperfine shows we go from 3.6s to 3.06s for N=18; and from 7.7s to 6.23s for N=19
run
, not in workersthis shows some real differences on test/t_tree_futs.exe at N=18 or 19. hyperfine shows we go from 3.6s to 3.06s for N=18; and from 7.7s to 6.23s for N=19