Properly evaluate the scratchpad size necessary for CUB reductions

For parallel reductions, CUB requires a block of shared memory, which is wrapped in an opaque object. Shared memory promotion mechanism needs to know the amount of available shared memory. Currently, we bluntly subtract 4096 bytes from the amount of shared memory on GPU for each reduction we use. Some reductions apparently require more space than that, which leads to our kernels requiring more shared memory than physically available. This manifests as either "invalid ptx" or "launch out of resources" errors. #299 allows the autotuner to survive these errors, but they may be avoided completely if the size of the reduction scratchpad is known.

The logic for shared memory allocation inside CUB is incredibly complicated due to using different algorithms depending on the available architectural features. It is implemented as templates making it impossible to extract the size computation into runtime without reimplementing it. However, maintaining an eventual reimplementation on par with CUB changes seems hard. One possible solution could be to emit the code (or only the reduction part) without shared memory promotion, compile it into PTX, and parse PTX to collect the required amount of shared memory. This may create a notable overhead in compilation time though.

facebookresearch / TensorComprehensions

Properly evaluate the scratchpad size necessary for CUB reductions #300