Open denisalevi opened 3 years ago
See my explanations in #266. We use 36 registers, that means we can't run 2048 threads per block due to registers per SM limits (would need 32 registers per thread for that). Hence we use less threads than 1024, leading to lower theoretical occupancy.
The occupancy value is a theoretical occupancy per SM, so it is 100% independent of number of blocks. But to actually fully use all SMs, one would need 6 blocks here (since there are 3 SMs that can run 2 blocks each on the MX150).
TODO: Modify the info message to say "theoretical occupancy per SM", to make this distinction clearer.
For the following example, the stateupdater doesn't achieve full occupancy on my laptop GPU (MX150). Why? Is this a GPU ressource limitation or is there something going wrong in the occupancy calculation?
This gives
Why do we use
7 blocks
for the stateupdater? How do we get100% occupancy
with only5 blocks
for the the thresholder and resetter if the occupancy calculation says that we need6 blocks
?To get the
(need 6 blocks for 1.000)
, I printed themin_num_threads
variables (which should be calledmin_num_blocks
...).