Currently we use ThreadThreadRange to distinguish elements inside elemental kernels. Limiting the number of threads per block we can use to 32 for 5 x 5 quadrature in 2D. A better way of doing this would be use ThreadVectorRange to distinguish elements - which would allow to use arbitary number of threads inside a block.
Theoretical occupancy would increase from 50% -> 100% (not taking into account number of registers)
Possibly would allow me to break the kernel up to reduce register usage
Currently we use ThreadThreadRange to distinguish elements inside elemental kernels. Limiting the number of threads per block we can use to 32 for 5 x 5 quadrature in 2D. A better way of doing this would be use ThreadVectorRange to distinguish elements - which would allow to use arbitary number of threads inside a block.