Added support for 128 residual (R) channels with S=A=256. Use constant number of blocks for softmax in persistent mode to reduce total number of blocks. Added CUDA device selection option.
A major change to note is that the "launch_" family of functions have been changed to functors in order to support partial template specialization.
Added support for 128 residual (R) channels with S=A=256. Use constant number of blocks for softmax in persistent mode to reduce total number of blocks. Added CUDA device selection option.
A major change to note is that the "launch_" family of functions have been changed to functors in order to support partial template specialization.