charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
199 stars 50 forks source link

Fast lock-free queue for nodegroup messages #3812

Open nilsdeppe opened 1 month ago

nilsdeppe commented 1 month ago

I did a very simple comparison of the current queue used in Charm++ for nodegroup messages, which I believe is a locking queue, with a lock-free queue here https://github.com/cameron314/concurrentqueue I noticed that Charm++'s currently queue added around 10% overhead to 1-3ms messages compared to the queue I linked. This is quite significant but there are tradeoffs. First, the concurrent queue does not provide a FIFO guarantee at the node level, though Charm++ doesn't guarantee this anyway in the sense of message order delivery, so this is likely fine. There is the possibility for the concurrent queue to end up in a "busy wait" in the application if the application checks whether or not the message can be evaluated, and if not inserts itself again. If there are other messages in the local queue this is likely fine, but it could cause issues. We likely need something that tracks whether or not an insert was a "reinsert" to avoid this spinning. I don't know if this comes up in practice. Several implementations and benchmark comparisons of different queues are available here: https://github.com/max0x7ba/atomic_queue My read from all this is that the atomic_queues may work if static sizing can be combined with a dynamic grow/shrink in the number of queues, but this is likely a bit more work. The Ramalhete queue https://github.com/mpoeter/xenium seems like it might actually be the best choice, but I haven't familiarized myself with the implementation to assess whether there could be subtle edge cases like spinning from self-requeue, etc.

There is also the SALSA approach: https://people.csail.mit.edu/idish/ftp/SALSA-full.pdf "SALSA: Scalable and Low Synchronization NUMA-aware Algorithm for Producer-Consumer Pools" but as far as I know there's no public implementation of this.