Araq / malebolgia

Malebolgia creates new spawns. Structured concurrency. Thread pools and related APIs.
MIT License
104 stars 10 forks source link

Experiment: Nessie #21

Closed hugosenari closed 11 months ago

hugosenari commented 11 months ago

Disclaimer:

This is an experiment, I would not expect it to be approved, but I updated the src/malebolgia.nim to make diff clear. Please burn after reading.

Motivation:

Intrigued by the fact that spawn is expensive, I tried to reduce the lock time.

The initial idea was to use the same concept of parMap, means lock array ranges to worker, but ended creating an array of locks.

Unexpected Outcome:

One expensive call of spawn was signal, not lock, I tried to reduce the number of signal calls too.

Results

My benchmark, was the same as I did for #18 parMap with bulkSize2, 5 tasks, 3 threads executed 1000 times

While this isn't always faster than #19, times are steady.

As expected, T0E0 would take more time, because pool will be notified only after the channel is full, or we run waitAll.

This also means T1E0-T0E1 is almost constant, but I was expecting less, since pool will look for T0 and T1 at the same time.

Benchmark Flaws

I did not force our thread pool, channel can have 16 tasks, I only added 3.

Suggestion

Revise the signalling strategy.

Araq commented 11 months ago

Tbh I have no idea what this array of locks tries to accomplish. The standard solution to speed things up further is "work stealing". I implemented a variation of it in malebolgia_push but it was not faster on my benchmarks and more complex.

Araq commented 11 months ago

cause 100% of cpu usage with unused threads

That's bad but what is worse is that as soon as you remove the possibility of running things directly on the calling thread you invite deadlocks for recursive fork&join algorithms.

hugosenari commented 11 months ago

Tbh I have no idea what this array of locks tries to accomplish.

Let's say Master had another name like Global Let's say its lock is Important Then we name it GlobalImportantLock to abbreviate, we call it GIL...

That is obviously a joke, but you got the problem.

Instead, let's say we have 16 locks/slots (lets call it LockPool) We can only feed to lockable emptySlot We can only work in lockable taskSlot

Now we ask if lock[i] is lockable (tryAcquire) cost from 20~2000ns While acquire cost from 25~infinity (we never know when/who will release it)

While we are feeding the slots we could signal (20~2000ns) without lock. I would do it every 4 items.

When we run out of slots to feed, we have to signal it with GIL Lock. But even here we could think: If we get lock to wait ok, or else, someone is trying to say what we are waiting for. Let's retry instead.

From other side (workers) is the same.

We have to ThreadPool because the cost of wait for thread creation (52781~76092ns) New we have a LockPool because of the cost of wait for acquire.

I tried to implement it in the file malebolgia_gilson.nim. with some optmization like try check value instead of lock, since false positive and move next is less expensive than tryAcquire.

Araq commented 11 months ago

As I said, it seems to me that you want to implement work stealing. It can be done lockfree and has the potential to produce much better performance numbers. My attempts failed so far but all the production thread pools in Rust, C#, etc use it.

hugosenari commented 11 months ago

Closing it, it was an experiment anyway.

I spotted a bug on that, since I'm not using a queue, acquireSlot/acquireTask always return first available, means it could return i=0 even if another task i=1 was scheduled before.

Araq commented 11 months ago

Once again, please give us work stealing! ;-)