This is important for Hopper MMA (see #3278) in which we only parallelize TIDx on the allocation domain of the MmaOp output. Currently this leads to us generating a usable kernel but we are not able to launch it properly because we can't infer the x dimension of the block size. This PR fixes that by replacing tv->getLoopDomain() with tv->domain()->allIDs() which will inspect the root, logical, loop, allocation domains and even intermediate IterDomains to try and find parallelized dimensions.
This is important for Hopper MMA (see #3278) in which we only parallelize TIDx on the allocation domain of the MmaOp output. Currently this leads to us generating a usable kernel but we are not able to launch it properly because we can't infer the x dimension of the block size. This PR fixes that by replacing
tv->getLoopDomain()
withtv->domain()->allIDs()
which will inspect the root, logical, loop, allocation domains and even intermediate IterDomains to try and find parallelized dimensions.