ROCm / Tensile

Stretching GPU performance for GEMMs and tensor contractions.
MIT License
208 stars 144 forks source link

Prediction model for optimal number of stream-k tiles to run #1934

Closed AlexBrownAMD closed 2 months ago

AlexBrownAMD commented 2 months ago

The number of tiles to run in stream-k mode vs data parallel mode depends on the kernel's macro tile and the problem size. If the number of tiles is evenly divisible by the number of CUs (grid size), then the kernel will run in a fully data-parallel persistent kernel mode. But in the average case, the problem doesn't isn't evenly divisible and some tiles need to be run in stream-k mode.

Remainder Tiles = Total Tiles % Grid Size

If the number of remainder tiles is large, it is optimal to run only the remainder tiles in stream-k. But if the number of remainder tiles is really small it is better to run the remainder tiles + 1 full grid of tiles. For example, say a problem has 1001 tiles and the grid size is 100, leaving 1 remainder tile. It is faster to run 100+1 tiles in stream-k mode than to spit the 1 remainder tile across 100 workgroups.

FullTile vs RemainderTile was previously a tuning parameter for experimentation. This change turns the feature into a run-time decision by predicting optimal performance. The threshold for running FullTile vs RemainderTile also depends on the problems K dimension. For larger K, it is better to remainder only more often since the problem is more like one we would use a GSU kernel to solve with more splitting. For small K, it is usually better to run FullTile + Remainder. The actual thresholds listed in the change were measured and tested, and there is supporting documentation on the related tickets.

This change also includes a new environment variable, TENSILE_STREAMK_FULL_TILES, which allows us to override the prediction model for testing purposes. If we find a problem size that has a performance cliff that we suspect is caused by this prediction model, we can override it and retest without having to tune any new kernels, and then fix the problem with a small adjustment to the predictor.

The current predictor was made and tested on one architecture. As a future work, this code needs to be generalized and extended to be applicable for additional device architectures.

nakajee commented 2 months ago

Looks like some fail in gfx1101 precheckin. Would you please take a look?

AlexBrownAMD commented 2 months ago

Looks like some fail in gfx1101 precheckin. Would you please take a look?

Yes, the error was related to my change. Just posted a fix and rerunning CI to make sure it fixes all test errors.