Make AccCpuThreads parallelize over blocks

jkelling commented 4 years ago

AccCpuThreads is currently a bad showcase of C++11 threads as it uses the sub-optimal strategy of spawning CPU threads at thread instead of block level just like the equally useless AccOmp2Threads.

To escape Amdahl's law as long as possible you must parallelize at as high a level at possible. AccCpuThreads should do this.

I am aware, that AccCpuThreads is, and will remain, only a demonstration, thus there will not be many resources to be spend on improving it. Just putting the issue here for reference, based on a discussion we had.

BenjaminW3 commented 4 years ago

Similar to AccOmp2Threads and AccOmp2Blocks we could call the current backend AccCpuStdThreadThreads and add a new one AccCpuStdThreadBlocks.

However, it is not yet clear to me how this can improve things. Yes, we could simply execute multiple blocks in parallel. However, this restricts the block size to only a single thread. This is necessary because threads within a block can be explicitly synchronized/interrupted which is not possible by default with std::threads executing a function (in our case a kernel). We can not simply execute one kernel invocation after the other for all threads in a block. I do not see how this would be faster than the current version. We are already using a thread-pool, so the threads are already not recreated for each block.

What we would need is a mix of cooperative multitasking (std::threads) for the blocks and preemptive-multitasking for the threads so that we get the most benefits from caches because a single std::thread executes a whole block.

Accelerator Back-end	Execution strategy grid-blocks	Execution strategy block-threads
AccCpuStdThreadThreads	sequential	parallel (preemptive multitasking)
AccCpuStdThreadBlocks	parallel (preemptive multitasking)	sequential (only 1 thread per block)
AccCpuFibers	sequential	parallel (cooperative multitasking)
Mixed	parallel (preemptive multitasking)	parallel (cooperative multitasking)

I already proposed this in #22.

jkelling commented 4 years ago

However, it is not yet clear to me how this can improve things. Yes, we could simply execute multiple blocks in parallel. However, this restricts the block size to only a single thread.

Yes, having only one thread per block, like AccOmp2Blocks is the idea. Alpaka kernels are probably mostly written with GPU (i.e. the CUDA backend) in mind, which will often lead to one the following two properties:

small workload per thread
synchronization between threads used for collective operations which cater to GPU memory architechture/SIMT

I do not see how this would be faster than the current version. We are already using a thread-pool, so the threads are already not recreated for each block.

The two cases above are currently slow because: 1.

With a small workload even the thread pool will have produce noticeable overhead when distributing alpaka threads to the OS threads from the pool.
The parallel work in the threads is either sequentially scheduled by block (invoking Ahmdal's law) or we are running so many threads, that we get congestion issues on the CPU.
1. Collective operations implemented as optimizations for GPU do not help on CPU while the synchronization they require is significantly more expensive on CPU ("warp-size" of one) then on GPU (warp-size of 32/64). The overhead would be removed by having only one thread per block.

What we would need is a mix of cooperative multitasking (std::threads) for the blocks and preemptive-multitasking for the threads so that we get the most benefits from caches.

With the type of GPU codes I described above, one would ideally map block to OS threads and threads to SIMD, because this would be closest to the GPU architectures. I know, that this is not viable, because SIMD lacks some important bits of SIMT (i.e. __syncthreads() and scattered memory load/store). The best we can is to map OS threads to blocks, ignore threads and leave SIMD to the element layer.

I already proposed this in #22. I do not know boost fiber, but the issue could be the same.

jkelling commented 4 years ago

Btw. OS threads sharing data in a block are probably just as likely to slow each other down due to false sharing (causing each others caches to be flushed) as they are to profit from sharing cache lines.

BenjaminW3 commented 4 years ago

Btw. OS threads sharing data in a block are probably just as likely to slow each other down due to false sharing (causing each others caches to be flushed) as they are to profit from sharing cache lines.

Yes, this is probably one of the main issues with the current version of AccCpuThreads. The exact opposite would be true for my Mixed proposal.

jkelling commented 4 years ago

Btw. OS threads sharing data in a block are probably just as likely to slow each other down due to false sharing (causing each others caches to be flushed) as they are to profit from sharing cache lines.

Yes, this is probably one of the main issues with the current version of AccCpuThreads. The exact opposite would be true for my Mixed proposal.

Are suggesting to implement premtive multi-tasking a thread-level in alpaka by hand? Maybe using coroutines?

BenjaminW3 commented 4 years ago

Boost.Fiber is a library with an interface similar to std::thread but based on Boost.Context which implements cooperative multitasking by hand.

j-stephan commented 2 years ago

The Intel oneAPI compiler for CPUs solves the issue in the following way:

SYCL work-groups (= blocks) are equivalent to a single CPU thread.
SYCL work-items (= threads) are equivalent to SIMD lanes.

So when the compiler encounters a work-group it will launch its auto-vectorizer. I wonder if we can do something similar in alpaka, maybe by making better use of the element layer.

alpaka-group / alpaka

Make AccCpuThreads parallelize over blocks #1135