Poor CPU utilization running many instances of SNOPT solver concurrently

This is related to my problem discussed in #14782, but I'm finding even by creating a thread-safe version of my IK problem, I'm getting poor cpu utilization when running in parallel.

I've set up five benchmarks to compare the differences

Running Solve(ik.prog()) NUM_THREAD times with one thread without any constraints in a for loop of NUM_THREADS iterations (https://gist.github.com/brawner/9d1512745feeadffd02944580ccfc549#file-parallel-vs-single-core-comparisons-L68-L82)
Running Solve(ik.prog()) NUM_THREAD times without any constraints across NUM_THREADS threads (https://gist.github.com/brawner/9d1512745feeadffd02944580ccfc549#file-parallel-vs-single-core-comparisons-L84-L108)
Running 'Solve(ik.progr())single-threaded with a randomly reachable pose constraint forNUM_THREADS` iterations, but with an early break if a successful configuration is found (https://gist.github.com/brawner/9d1512745feeadffd02944580ccfc549#file-parallel-vs-single-core-comparisons-L110-L134)
Running Solve(ik.prog()) multi-threaded with a randomly reachable pose constraint with NUM_THREADS threads (https://gist.github.com/brawner/9d1512745feeadffd02944580ccfc549#file-parallel-vs-single-core-comparisons-L136-L171)
Same as 3, but no early break out of the for loop (https://gist.github.com/brawner/9d1512745feeadffd02944580ccfc549#file-parallel-vs-single-core-comparisons-L173-L196)

Results on a AMD Threadripper 32-core 3970X, with 64GB Ram, and 64 threads.

----------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations
----------------------------------------------------------------------------------------------------
BM_solve_ik_no_constraints_single_core                       4802640 ns      4801394 ns          145
BM_solve_ik_no_constraints_parallel                          7472375 ns      7292267 ns           95
BM_solve_ik_random_constraints_single_core_early_break       6226976 ns      6225102 ns          100
BM_solve_ik_random_constraints_parallel                     34802849 ns      9778835 ns           70
BM_solve_ik_random_constraints_single_core_no_early_break   86606205 ns     86579752 ns           12

These results are surprising because it suggests that running for 64 iterations in a for loop single-core is faster than 64 concurrent threads without any constraints, and not that much worse with a pose constraint.

BM_solve_ik_random_constraints_single_core_early_break is significantly faster than running many threads in parallel to arrive at one solution, which is disappointing because I was hoping switching to a multi-threaded version would yield better performance.

You'll notice that the difference in CPU time between benchmark 4 and 5 is much more pronounced than the wall time, which I suspect means there is a lot of lock-contention happening somewhere.

Some notes about the benchmarks: In typical use cases with single-core, you wouldn't recreate the IK problem, which is what I've done for benchmarks 1 and 3. For benchmark 5, I try to match the contents of the thread with the contents of the for loop, and I still think it illustrates the issue. You would expect benchmark 4 to be up-to 64 times faster than benchmark 5, but I've seen typically only about a 3x improvement like above.

Full code at: https://gist.github.com/brawner/9d1512745feeadffd02944580ccfc549

RobotLocomotion / drake

Poor CPU utilization running many instances of SNOPT solver concurrently #14783