daadaada / turingas

Assembler for NVIDIA Volta and Turing GPUs
MIT License
201 stars 40 forks source link

Shared Memory Sample Code #6

Closed clarencewxl closed 4 years ago

clarencewxl commented 4 years ago

I just read your paper about gemm optimization (IPDPS) and winograd optimization (PPOPP), and I am very interested in the turingas. But the lack of sample code makes me so hard to start.

So, could you please upload more sample code? First of all, could you please upload the sample code about LDS and STS shared memory?

Thx.

daadaada commented 4 years ago

I will add more examples.

A quick reference for LDS and STS: you may load data with: --:-:1:-:2 LDS.32 lds_dst, [ptr]; 02:-:-:-:2 MOV dst, lds_dst;

and store data to shared memory with: --:1:-:-:2 STS.32 [ptr], sts_src;

Note that the ptr is 32-bit wide for shared memory.

daadaada commented 4 years ago

LDS and STS also support constant offset, e.g., STS.32 [ptr+0x100], sts_src;

clarencewxl commented 4 years ago

How can I get the "ptr" of the shared memory?

I just what to test the latency and the throughput of the LDS and STS, and here is my code: --:-:1:-:2 LDS.32 R100, [0x100]; 02:-:-:-:2 MOV R200, R100; --:-:-:-:2 EXIT; (I will copy LDS instruction 10000 times if I use it right.) The Error code is : CUDA Error: an illegal memory access was encountered.

I know the "0x100" is a wrong address, but How to get a right one.

Many thx.

daadaada commented 4 years ago

I have added an example to how to measure the throughput of lds32 (https://github.com/daadaada/turingas/tree/master/examples/bench/smem). You can try it.