Propagate the changes expressed in cudaCompat.h to source code changes in the serial variant. This makes it easier to read as a "serial" (or non-CUDA) code.
Most of the changes were made programmatically using comby (https://comby.dev) and cleaned up with clang-format.
The cuda-specific code (under the CUDA_ARCH preprocessor define and all of radixSort.h) was removed manually.
The comby matches/rewrites are stored in configuration files (.toml suffix). Individual configuration files are applied with the command:
comby -i -config remove_syncthreads.toml -matcher .c -d pixeltrack-standalone/src/serial
Replace gridDim, blockDim, gridIdx, blockIdx. Some of the patterns get more involved in order to simplify the resulting expressions. Also some of the expressions are assigned to an auto variable, which needs to have a more specific type once the expression is simplified to a number.
Remove syncthreads, threadfence, etc.
Remove __ldg
There were some remaining cuda index references (to the z component) that were removed manually.
The atomic calls were left alone as useful markers to know where to put atomic operations in any parallel code based off this branch.
Propagate the changes expressed in cudaCompat.h to source code changes in the serial variant. This makes it easier to read as a "serial" (or non-CUDA) code.
Most of the changes were made programmatically using comby (https://comby.dev) and cleaned up with clang-format. The cuda-specific code (under the CUDA_ARCH preprocessor define and all of radixSort.h) was removed manually.
The comby matches/rewrites are stored in configuration files (.toml suffix). Individual configuration files are applied with the command:
comby -i -config remove_syncthreads.toml -matcher .c -d pixeltrack-standalone/src/serial
The configuration files are attached serial_rewrite.tar.gz
Some of the changes:
__host__, __device__, __global__, __shared__, __forceline__
attributesauto
variable, which needs to have a more specific type once the expression is simplified to a number.__ldg
There were some remaining cuda index references (to the z component) that were removed manually.
The atomic calls were left alone as useful markers to know where to put atomic operations in any parallel code based off this branch.