[WIP] Add a conditional definition of AccThreadSeq for the Omp4 backend

sbastrakov commented 4 years ago

This is now consistent with the Omp2Blocks backend, fixes #156.

fwyzard commented 4 years ago

This doesn't harm, but it doesn't help my test case either.

Before:

Running with the non-blocking TBB CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 123.52 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 42.24 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 52493 us

After:

Running with the non-blocking TBB CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 128.16 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 42.3 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 52445.2 us

While using alpaka directly:

Running with the non-blocking TBB CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 132.66 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 41.14 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 391.15 us

sbastrakov commented 4 years ago

Thanks for testing @fwyzard . @psychocoderHPC do you have an idea? I feel the change fixes an actual shortcoming, but perhaps this is just another thing unrelated to the issue.

sbastrakov commented 4 years ago

Ah, the IsThreadSeqAcc trait was also not specialized for Omp4, will add it to the PR.

sbastrakov commented 4 years ago

Updated the PR, maybe now it works.

fwyzard commented 4 years ago

It does:

Running with the non-blocking TBB CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 137.34 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 42.27 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 354.39 us

Thank you.

psychocoderHPC commented 4 years ago

I will check this PR soon. I need to check if swapping is allowed for the OMP4 backend. Swapping block size and elements is more than less a workaround to be allowed to use alpaka backends which restrict the block to one thread. The OMP4 backend do not have this restriction. Currently the swapping is only providing compatibility and not aims to increase the performance. In general there should be no argument to use this feature to deliver better performance.

fwyzard commented 4 years ago

I see... then I guess there is an underlying problem in the OpenMP 4 backend in alpaka, that leads to very poor performance.

sbastrakov commented 4 years ago

I will close the PR for now as, although it provided a workaround for the issue initiating it, there are other ways to achieve it and the PR does not align well with the behavior for other accelerators. The reasoning is given in the discussion of #156.

alpaka-group / cupla

[WIP] Add a conditional definition of AccThreadSeq for the Omp4 backend #157