The-OpenROAD-Project / OpenSTA

OpenSTA engine
GNU General Public License v3.0
404 stars 172 forks source link

Fewer locks in `DispatchQueue` #224

Closed kbieganski closed 1 month ago

kbieganski commented 7 months ago

DispatchQueue locks the underlying job queue on basically any operation. The main thread locks the mutex when pushing a new job, the worker threads lock the mutex when popping a job off the queue. Thanks to this, jobs can be executed eagerly. However, all the locking is one of the causes of OpenROAD scaling poorly on more than 4 threads.

This PR changes that behavior so that the jobs are run only after they're all in the queue. This means we don't need to do any locking when pushing a new job. Once all jobs are in the queue, we notify the worker threads, and each one works on a chunk of the queue. The only locking that happens is for notifying threads that there is work available/there is work to be done. The worker threads don't lock anything while iterating over the jobs.

Benchmark for global routing on BlackParrot (10 samples each): black_parrot_boxplot

Global routing on Ibex (20 samples, sorry about the variance): ibex_boxplot

rovinski commented 7 months ago

Have you tried using separate mutexes for enqueue/dequeue? That could allow for simultaneous pushing and popping of jobs. You could also consider multiple queues with a round-robin distribution.

https://www.cs.rochester.edu/~scott/papers/1996_PODC_queues.pdf

kbieganski commented 7 months ago

I initially considered that, but rejected the idea as that would still require some locking. But after your comment, I decided to try it, and the results are pretty good:

black_parrot_boxplot-2 ibex_boxplot-2

(dispatch-multiqueue is the one with multiple queues that starts job execution immediately)

I think the original one is still slightly better (a bit faster and simpler), but I'm okay with doing it this way. At this point there are bigger problems to solve as far as multi-threaded scaling goes, and we can always go back and improve DispatchQueue.

BTW I'm working on redoing the benchmarks with master as a baseline as it no longer crashes in multi-threaded mode. I don't expect a big difference, though. Updated the PR description with new graphs comparing to master.

rovinski commented 7 months ago

Thanks for investigating. Faster and simpler is better. It's unfortunate it doesn't scale beyond 8 right now, but faster is always better.

maliberty commented 1 month ago

Issues or PRs should be filed with https://github.com/parallaxsw/OpenSTA if still relevant. This is effectively a fork (though not strictly for historical reasons).