Implement device functions to simplify writing kernel code

alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:

https://alpaka.readthedocs.io

Mozilla Public License 2.0

358 stars 74 forks source link

Implement device functions to simplify writing kernel code #2337

Closed fwyzard closed 3 months ago

fwyzard commented 3 months ago

Implement device functions to simplify writing kernel code:

oncePerGrid(acc);
oncePerBlock(acc);
uniformElements(acc, size) and uniformElements(acc, begin, end);
uniformGroups(acc, size) and uniformGroupElements(acc, group, size);
independentGroups(acc, groups) and independentGroupElements(acc, group_size);
variations of these functions along different dimensions.

Implement tests for the most common functions.

fwyzard commented 3 months ago

let's see if the 23rd time's a charm...

SimeonEhrig commented 3 months ago

I check your code locally. The problem is the AccCpuThreads backend. On our workstation with 32 Core Epyc and Clang 17 debug build, the test already runs 100s. I think the annotation of the thread sanitizer also add a big overhead and the CI runners are weaker than our workstation.

I suggest to disable the tests with std::thread backend or if it make sense to test the thread backend, decrease the problem size.

fwyzard commented 3 months ago

I suggest to disable the tests with std::thread backend or if it make sense to test the thread backend, decrease the problem size.

I agree... I've tried reducing the block size in all tests to 32 threads (or elements) per block.

fwyzard commented 3 months ago

Closed in favour of https://github.com/alpaka-group/alpaka/pull/2369 that includes further fixes and clean up.