Open jkelling opened 4 years ago
Similar to AccOmp2Threads
and AccOmp2Blocks
we could call the current backend AccCpuStdThreadThreads
and add a new one AccCpuStdThreadBlocks
.
However, it is not yet clear to me how this can improve things. Yes, we could simply execute multiple blocks in parallel. However, this restricts the block size to only a single thread. This is necessary because threads within a block can be explicitly synchronized/interrupted which is not possible by default with std::thread
s executing a function (in our case a kernel). We can not simply execute one kernel invocation after the other for all threads in a block.
I do not see how this would be faster than the current version. We are already using a thread-pool, so the threads are already not recreated for each block.
What we would need is a mix of cooperative multitasking (std::thread
s) for the blocks and preemptive-multitasking for the threads so that we get the most benefits from caches because a single std::thread
executes a whole block.
Accelerator Back-end | Execution strategy grid-blocks | Execution strategy block-threads |
---|---|---|
AccCpuStdThreadThreads | sequential | parallel (preemptive multitasking) |
AccCpuStdThreadBlocks | parallel (preemptive multitasking) | sequential (only 1 thread per block) |
AccCpuFibers | sequential | parallel (cooperative multitasking) |
Mixed | parallel (preemptive multitasking) | parallel (cooperative multitasking) |
I already proposed this in #22.
However, it is not yet clear to me how this can improve things. Yes, we could simply execute multiple blocks in parallel. However, this restricts the block size to only a single thread.
Yes, having only one thread per block, like AccOmp2Blocks is the idea. Alpaka kernels are probably mostly written with GPU (i.e. the CUDA backend) in mind, which will often lead to one the following two properties:
I do not see how this would be faster than the current version. We are already using a thread-pool, so the threads are already not recreated for each block.
The two cases above are currently slow because: 1.
What we would need is a mix of cooperative multitasking (
std::thread
s) for the blocks and preemptive-multitasking for the threads so that we get the most benefits from caches.
With the type of GPU codes I described above, one would ideally map block to OS threads and threads to SIMD, because this would be closest to the GPU architectures. I know, that this is not viable, because SIMD lacks some important bits of SIMT (i.e. __syncthreads() and scattered memory load/store). The best we can is to map OS threads to blocks, ignore threads and leave SIMD to the element layer.
I already proposed this in #22. I do not know boost fiber, but the issue could be the same.
Btw. OS threads sharing data in a block are probably just as likely to slow each other down due to false sharing (causing each others caches to be flushed) as they are to profit from sharing cache lines.
Btw. OS threads sharing data in a block are probably just as likely to slow each other down due to false sharing (causing each others caches to be flushed) as they are to profit from sharing cache lines.
Yes, this is probably one of the main issues with the current version of AccCpuThreads
.
The exact opposite would be true for my Mixed
proposal.
Btw. OS threads sharing data in a block are probably just as likely to slow each other down due to false sharing (causing each others caches to be flushed) as they are to profit from sharing cache lines.
Yes, this is probably one of the main issues with the current version of
AccCpuThreads
. The exact opposite would be true for myMixed
proposal.
Are suggesting to implement premtive multi-tasking a thread-level in alpaka by hand? Maybe using coroutines?
Boost.Fiber is a library with an interface similar to std::thread but based on Boost.Context which implements cooperative multitasking by hand.
The Intel oneAPI compiler for CPUs solves the issue in the following way:
So when the compiler encounters a work-group it will launch its auto-vectorizer. I wonder if we can do something similar in alpaka, maybe by making better use of the element layer.
AccCpuThreads is currently a bad showcase of C++11 threads as it uses the sub-optimal strategy of spawning CPU threads at thread instead of block level just like the equally useless AccOmp2Threads.
To escape Amdahl's law as long as possible you must parallelize at as high a level at possible. AccCpuThreads should do this.
I am aware, that AccCpuThreads is, and will remain, only a demonstration, thus there will not be many resources to be spend on improving it. Just putting the issue here for reference, based on a discussion we had.