Confusion about `block_tile_j` in cudaTensorCoreGemm.cu

I have a little confusion about the calculation formula for block_tile_j in cudaTensorCoreGemm.cu at line 230: const unsigned int block_tile_j = (block_pos * BLOCK_COL_TILES) % N_TILES;

Theoretically, block_tile_i and block_tile_j are the row index and column index of tiles from matrix C or D, respectively. Now that block_tile_i equals to block position(block_pos) multiply logical tile numbers per row in per thread block(BLOCK_ROW_TILES), then divide total tile numbers along N direction(N_TILES) and multiply logical tile numbers per column(BLOCK_COL_TILES) , i.e. const unsigned int block_tile_i = ((block_pos * BLOCK_ROW_TILES) / N_TILES) * (BLOCK_COL_TILES);.

Therefore why block_tile_j is equal to (block_pos * BLOCK_COL_TILES) % N_TILES, not (block_pos * BLOCK_ROW_TILES) % N_TILES?

Look forward someone can resolve my confusion, Thanks.

NVIDIA / cuda-samples

Confusion about `block_tile_j` in cudaTensorCoreGemm.cu #288