Closed clarencewxl closed 4 years ago
I will add more examples.
A quick reference for LDS and STS:
you may load data with:
--:-:1:-:2 LDS.32 lds_dst, [ptr];
02:-:-:-:2 MOV dst, lds_dst;
and store data to shared memory with:
--:1:-:-:2 STS.32 [ptr], sts_src;
Note that the ptr is 32-bit wide for shared memory.
LDS and STS also support constant offset, e.g.,
STS.32 [ptr+0x100], sts_src;
How can I get the "ptr" of the shared memory?
I just what to test the latency and the throughput of the LDS and STS, and here is my code: --:-:1:-:2 LDS.32 R100, [0x100]; 02:-:-:-:2 MOV R200, R100; --:-:-:-:2 EXIT; (I will copy LDS instruction 10000 times if I use it right.) The Error code is : CUDA Error: an illegal memory access was encountered.
I know the "0x100" is a wrong address, but How to get a right one.
Many thx.
I have added an example to how to measure the throughput of lds32 (https://github.com/daadaada/turingas/tree/master/examples/bench/smem). You can try it.
I just read your paper about gemm optimization (IPDPS) and winograd optimization (PPOPP), and I am very interested in the turingas. But the lack of sample code makes me so hard to start.
So, could you please upload more sample code? First of all, could you please upload the sample code about LDS and STS shared memory?
Thx.