If we have two or four memory requests by a warp, do they need coalesced access/contiguity?

ocwins commented 3 years ago

Hi,

I'm trying to implement my own iterator for a special scene, I have read the manual and it says:

If the size of the words accessed by each thread is more than 4 bytes, a memory request by a warp is first split into separate 128-byte memory requests that are issued independently:
 Two memory requests, one for each half-warp, if the size is 8 bytes,
 Four memory requests, one for each quarter-warp, if the size is 16 bytes.
(see: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0)

Then I have a question: If we have two or four memory requests by a warp, do they need coalesced access/contiguity?

For example, first 16 threads read 8x16 bytes from address A, other 16 threads read another 8x16 bytes from address B, and the beginning of B is not following the end of A, is there a performance penalty in this way ?

Any hints are appreciated.

Aha! Link: https://nvaiinfa.aha.io/features/CUTLASS-11

hwu36 commented 3 years ago

What about just tell me the details of the problem you want to solve and let me provide you suggestions?

ocwins commented 3 years ago

What about just tell me the details of the problem you want to solve and let me provide you suggestions?

In a scene I have an int8 conv2d problem which have 16 channels and 3x3 filters. Currently I use the Interleave-32 ncxhwx layout (because I have used it on other larger layers), and fill the left channels with zero. I want to eliminate this waste.

I have read the codes and found that the iterator for global memory accesses iterates on filter's r and s, but can't use vectors from positions of different r or s (make them together) to satisfy the needs of smem iterator and gemm op. if I understand the codes correctly, a gemm's K of at least 32 is what they need (at warp level) and they don't really care about the conv2d but just need data for gemm. So I think I should implement a global iterator to feed the smem iterator.

If there is another way to do this, I'm happy. But I still want to know the answer to my original question. The knowledge could be useful in other cuda projects as well.

ocwins commented 3 years ago

What about just tell me the details of the problem you want to solve and let me provide you suggestions?

Ouch, forgot to say.

Only GEMMs which employ tensor op instructions are considered. Without this condition, what I said "a gemm's K of at least 32" is nonsense.

hwu36 commented 3 years ago

I have read the codes and found that the iterator for global memory accesses iterates on filter's r and s, but can't use vectors from positions of different r or s (make them together) to satisfy the needs of smem iterator and gemm op.

correct

a gemm's K of at least 32 is what they need (at warp level)

K or the channel C needs to be multiple of 32 for interleaved-32

they don't really care about the conv2d but just need data for gemm. So I think I should implement a global iterator to feed the smem iterator.

I don't really understand what you mean here. conv2d is mapped to gemm in cutlass. The only unique part of conv2d is the global iterator. The rest reuses the gemm components.

Are you running on Turing or Ampere? Do you use 8x8x16 or 16x8x32 tensor cores?

Being honest, I don't have an easy solution for small channel conv without wasting any resources.

Back to your original question,

For example, first 16 threads read 8x16 bytes from address A, other 16 threads read another 8x16 bytes from address B, and the beginning of B is not following the end of A, is there a performance penalty in this way ?

Do you mean every thread loads 8 times and every time loads 16B? If so, it is fine as long as every time t0 loads a, t1 loads a+16, t2 loads a+32, etc...

ocwins commented 3 years ago

Thank you for your reply.

I don't really understand what you mean here. conv2d is mapped to gemm in cutlass. The only unique part of conv2d is the global iterator. The rest reuses the gemm components.

I actually mean this, you have said it here. I'm trying to implement a global iterator also satisfy the gemm components, and load data from two positions.

Are you running on Turing or Ampere? Do you use 8x8x16 or 16x8x32 tensor cores?

For integer type, there are m16n8k16 and m16n8k32 instructions, is there a m8n8k16?

And with k16, I think that the pipeline still needs K doubled to at least 32 (each stage proceeds 16). On the other hand, the smem iterator doesn't have mechanism to work with int8 m8n8k16 and data layout where K is 16. Am I correct ?

Do you mean every thread loads 8 times and every time loads 16B? If so, it is fine as long as every time t0 loads a, t1 loads a+16, t2 loads a+32, etc...

No. I mean the first half-warp (T0...T15) reads total 128B (8B per thread) and the second half-warp(T16...T31) reads another 128B (8B per thread).

T00 loads 'a+0', T01 loads 'a+8'.... T15 load 'a+120', T16 loads 'b+0', T17 loads 'b+8'.... T31 load 'b+120' . But 'a+128' is not 'b+0'.

Thus there are two memory requests. one for first half-warp, another for second half-warp. So I guess that, coalesced access is still needed within each half-warp (16 threads), but each half-warp could access separate 128B memory block without performance penalty.

T0 ..... T15 __ T16 ... T31 A: 128B | data not used __ | B: another 128B not following A request 0 ___ request 1

But I can't find if it's true in the manual, and I don't have enough experience to make proper tests to check if it's true. So I come here for help.

hwu36 commented 3 years ago

For integer type, there are m16n8k16 and m16n8k32 instructions, is there a m8n8k16?

8816 is used on turing, 16832 and 16816 are used on ampere. just depends on which device do you use.

No. I mean the first half-warp (T0...T15) reads total 128B (8B per thread) and the second half-warp(T16...T31) reads another 128B (8B per thread).

Vectorized 16B load is more efficient, but if you only have tiny amount of data. 8B load is okay.

ocwins commented 3 years ago

Vectorized 16B load is more efficient, but if you only have tiny amount of data. 8B load is okay.

So each memory request for half-warp could access different memory location? (in half-warp 16 threads still keep coalesced access)

With 16B loads, there are 4 memory requests, so each quarter-warp(8 threads) totally loads 128B, a full-warp could access 4 different location in such situation ?

128B accessed by one memory request must be coalesced access, but different memory requests may not be contiguous even in same warp, and there is no performance penalty.

Am I correct ?

hwu36 commented 3 years ago

Your limiting factor is that you only have tiny data to load, not memory coalescing. With the tiny memory transaction you have, I actually don't expect how you load would matter.

ocwins commented 3 years ago

Your limiting factor is that you only have tiny data to load, not memory coalescing. With the tiny memory transaction you have, I actually don't expect how you load would matter.

Got it.

I also read some codes of "Conv2dFpropActivationTileAccessIteratorOptimized", I think if kAccessesPerVector > 1, one thread loads contiguous parts in same vector with several accesses, so the full-warp load won't be coalesced. So I guess it's not a big deal in this situation (I'm not sure if my understanding is right).

But the answer to my original question is still unclear:

With 8B/16B per thread, if we consider about memory coalescing, can we organize the accesses by half-warp/quarter-warp (because every half-warps/quarter-warps have their own memory request), but not full-warp?

If each half-warp/quarter-warp visit an contiguous part (128B), but these parts are from different locations in global memory, is there performance penalty?

hwu36 commented 3 years ago

I think if kAccessesPerVector > 1, one thread loads contiguous parts in same vector with several accesses, so the full-warp load won't be coalesced.

You are correct. However, the biggest goal of adding kAccessesPerVector > 1 code path is to maximally reuse kAccessesPerVector = 1 path, the performance is not the top priority. The performance of kAccessesPerVector > 1 suffers from non coalesced memory traffic, shared memory bank conflict, predicate register spill, etc.

If each half-warp/quarter-warp visit an contiguous part (128B), but these parts are from different locations in global memory, is there performance penalty?

Yes, you can. Your biggest issue here is not coalescing, but locality.

ocwins commented 3 years ago

Yes, you can. Your biggest issue here is not coalescing, but locality.

If I don't misunderstand, the answer is:

A warp can access two different memory regions which are not next to each other without performance loss, if each half-warp access same region with 8B per thread. A half-warp is T0...T15 or T16...T31 A warp can access four different memory regions which are not next to each other without performance loss, if each quarter-warp access same region with 16B per thread. A quarter-warp is T0...T7, T8...T15, T16...T23 or T24...T31. A "memory regions" here is contiguous 128 bytes.

I know it may not important for my current problem, but I still want to get it clear. When I meet a suitable situation, I may find out how to use this behavior. For now I will take your advice and think about locality.

Thank you.

ocwins commented 3 years ago

I think if kAccessesPerVector > 1, one thread loads contiguous parts in same vector with several accesses, so the full-warp load won't be coalesced.

You are correct. However, the biggest goal of adding kAccessesPerVector > 1 code path is to maximally reuse kAccessesPerVector = 1 path, the performance is not the top priority. The performance of kAccessesPerVector > 1 suffers from non coalesced memory traffic, shared memory bank conflict, predicate register spill, etc.

If each half-warp/quarter-warp visit an contiguous part (128B), but these parts are from different locations in global memory, is there performance penalty?

Yes, you can. Your biggest issue here is not coalescing, but locality.

Hi,

I have an additional question: if cp.async is used for copying data from global memory to shared memory, shall I care about bank conflict ? or I just give the arguments, and the hardware will do it best ?

Sorry to bother you with so many questions, I can't find enough info on the internet.

hwu36 commented 3 years ago

cp.async is the same as st.shared when considering shared memory bank conflict.

mnicely commented 3 years ago

@ocwins can you close this issue if your questions have been answered?

NVIDIA / cutlass

If we have two or four memory requests by a warp, do they need coalesced access/contiguity? #328