Araq / malebolgia

Malebolgia creates new spawns. Structured concurrency. Thread pools and related APIs.
MIT License
104 stars 10 forks source link

experiment: spin locks #22

Closed hugosenari closed 11 months ago

hugosenari commented 11 months ago

Disclaimer:

This is another experiment, I would not expect it to be approved

Motivation: Same as from #21

The initial idea was to use spin locks instead of locks, following this post recommendations, I'm using TTAS instead of TAS, I would like to use PAUSE, but is compiler/platform specific.

Unexpected Outcome:

It went pretty bad, worse than current implementation. So I created another version using more spin locks (like I did for #21), one per thread, so they almost won't fight with each other. And that one went very well.

Results

We can't compare it with #21 result because I changed the test to set EPOCH after waitAll and before spawn, is more accurate to the test intent (check how much time we are spending between spawn)

The benchmark results,

Since only spin locks makes results worse, so I focus on second implementation spin_doctors, it speeds up the results between 3~5 times.

But still no perfect:

  1. T1E0 - T0E1 increased because T0E1 decreased 75% percentile from ~19767ns, to ~2724ns
  2. OP increased because spin locks uses 100% (of CPU)
    • For this test 1 run with 3 spawns, means 5 others are just spinning.
    • Maybe is why I get better results with more spawns

Suggestion: Some sleep, no stealing.

I know that threads let us sleep using nanoseconds, we could sleep after N runs without any task, this could also help timeout.

There is no task to steal, because I'm only scheduling the same amount of threads as tasks

Araq commented 11 months ago

There is cpuRelax() and it's mapped to PAUSE or similar, enjoy.

So I think the takeaway is that my stuff is very hard to improve upon? :P

Please note that it is important that eventually there is no busy-loop and the OS is allowed to pause threads as this can be important for energy consumption and Malebolgia is supposed to run well on embedded systems. However, a runtime flag like useBusyLoop: bool could be added to the master object.

hugosenari commented 11 months ago

There is cpuRelax() and it's mapped to PAUSE or similar, enjoy.

Updated, thanks,

So I think the takeaway is that my stuff is very hard to improve upon? :P

You are lazy naming variables, other that, you try to do your best, so there is less room for improvement (in performance) ;-)

However, a runtime flag like useBusyLoop: bool could be added to the master object.

I added suggestions for a wait/signal, but I forgot to review send iterator It is currently round-robin, means no matter if thread01 is free I set task to next one, this would wake up all threads instead of use ones that are already up.

Round-robin:

Start from 0 always

Araq commented 11 months ago

You are lazy naming variables, other that, you try to do your best, so there is less room for improvement (in performance)

I don't really. This stuff is super hard so I tried to write it as simple as possible, though I'm not afraid of atomic instructions.

hugosenari commented 11 months ago

Closing as experiment completed,

Next one, should be work stealing, but you already played with it, and mention as slower. So I am not motivated.

I'm not afraid of atomic instructions.

I couldn't say the same, I'm still on page 183, and parallelism is only in page 253.

Araq commented 11 months ago

Next one, should be work stealing, but you already played with it, and mention as slower. So I am not motivated.

But I didn't use a lockfree queue for it and only tested it on a 8 core M1. Work stealing scales much better for bigger hardware. It shouldn't be hard to beat the existing implementation.

hugosenari commented 11 months ago

But I didn't use a lockfree queue for it and only tested it on a 8 core M1. Work stealing scales much better for bigger hardware. It shouldn't be hard to beat the existing implementation.

Malebolgia as MASTER branch or this PR in malebolgia_spin.nim

image

Malebolgia as this PR in malebolgia_spin_doctors.nim

image

My view of how Work Stealing looks like,

image

... NOPE! :rofl:

Ok, to be fair, this representation is worse than real the worst scenario at some point in time:
Because if a thread is ~shifting~ poping from other it isn't shifting from itself, or all other at the same time. But for other graphs, I represented a superposition of all possible cases. :thinking:

Notes: Simplified to single producer/N consumers, in reality can have N producers/consumers I know that .L isn't the queue, but is what we have to wait to operate the queue.

Araq commented 11 months ago

How did you make these graphs? They are awesome!

However, as I said, work stealing doesn't even have to use locks at all and can be done with a lockfree queue. And more importantly, all the highly concurrent/parallel runtimes ended up doing work stealing so it's reasonable to assume that it simply is a very good mechanism.

hugosenari commented 11 months ago

How did you make these graphs? They are awesome!

Coggle.it

https://coggle.it/diagram/ZQDcj9E4196r1jqQ/t/t0 https://coggle.it/diagram/ZQFMXn170vn6vXB5/t/t0-2023-9-12-22-39 https://coggle.it/diagram/ZPKLGeut86Ir78iB/t/t0