[stdpar] Port SiPixelClusterizer plugin

esseivaju commented 1 year ago

Changes

Port Kernels in the SiPixelClusterizer plugin to stdpar
Update gpuClustering_t and added test for atomic operations

esseivaju commented 1 year ago

@makortel the test gpuClustering_t currently fails. The test is set up to go through the sequence of kernels countModules --> findClus --> clusterChargeCut multiple times with different clusters data. The first iteration of the 3 kernels work and returns the exact same result as the cudauvm implementation, however, on the second iteration the kernel, findClus will fail:

[esseivaj@zeus pixeltrack-standalone]$ test/stdpar/gpuClustering_t 
created 175031 digis in 8960 clusters
CUDA countModules kernel launch
CUDA findModules kernel launch with 897 modules
before charge cut found 8960 clusters
last module is 1799 10
found 897 Modules active
first clusters 0 11000 1 10
last cluster 1799009 10
found 8959 8959 clusters
last module is 1799 10
created 176031 digis in 9960 clusters
CUDA countModules kernel launch
CUDA findModules kernel launch with 898 modules
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted (core dumped)

esseivaju commented 1 year ago

Attaching cuda-gdb on exception will output:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x103e8c0 (gpuClustering.h:179)

Thread 1 "gpuClustering_t" received signal CUDA_EXCEPTION_14, Warp Illegal Address.

We assert that the indices k and l accessed at line gpuClustering.h:179 are within bound of the array; similarly, we loop over p until we reach the last valid memory location so this line should not result in an out-of-bound memory access. cuda-memcheck will output the following error:

========= Out-of-range Shared or Local Address
=========     at 0x000000c0 in __cuda_syscall_mc_dyn_globallock_check

The problem seems to be independent of the data processed. Regardless of which index we start with (see dd6b0fd), which defines the data generated, the first iteration always succeeds and the second always fails. Arrays nn and nnn as well as hist are allocated within the std::for_each lambda on GPU stack so each thread (handling one module, which we have an order of ~900) is using more memory than in the cudauvm implementation where the hist was shared per thread block (nn and nnn were still allocated for each thread). If this is the case, I am not sure why it only fails after the second iteration.

makortel commented 1 year ago

About the assertions, in cuda those are disabled by default for the device code via https://github.com/cms-patatrack/pixeltrack-standalone/blob/7c9c98943fa078632431607ac6700c97ee6c20c4/src/stdpar/CUDACore/cuda_assert.h#L4-L16

I'm not sure how that gets mapped with nvc++. Maybe try to define GPU_DEBUG (e.g. along make stdpar USER_CXXFLAGS="-DGPU_DEBUG") to see if the assertions really pass?

esseivaju commented 1 year ago

@makortel Assertions are enabled; cuda_assert.h is not included in the test case and trying to inverse the assert condition (e.g. assert(k >= maxiter);) will trigger an exception:

/bld4/home/esseivaj/devel/pixeltrack-standalone/src/stdpar/plugin-SiPixelClusterizer/gpuClustering.h:161: _ZZN13gpuClustering8findClusEPKtS1_S1_PKjPjS4_PiiENKUlT_E_clIjEEDaS6_: block: [1,0,0], thread: [229,0,0] Assertion `k >= maxiter` failed.

makortel commented 1 year ago

Ok, there went that easy option. Maybe worth of merging regardless? (I didn't find anything to comment)

esseivaju commented 1 year ago

@makortel I think we can merge this for now. I'll focus on finishing porting the rest of the plugins first then fix the bugs that are still there.

cms-patatrack / pixeltrack-standalone

[stdpar] Port SiPixelClusterizer plugin #379

Changes