ROCm / Tensile

Stretching GPU performance for GEMMs and tensor contractions.
MIT License
200 stars 137 forks source link

Is Tensile adapted to RDNA2 ? #1579

Open v01dXYZ opened 1 year ago

v01dXYZ commented 1 year ago

Hello, As you may know RDNA2 has a 128MB L3 cache which is an important difference with the GCN/CDNA architecture, it allows to use efficiently a memory subsystem with a smaller bus width (although it has a throughput higher than a Vega 10) with 8 Samsung GDDR6 chips (8x32x16Gbps). Are tensile or MISA adapted to a microarchitecture where caching (ie spatial/temporal locality) is central to achieve peak performance ? Do you think RDNA2 could be as good or even better than a GCN/CDNA architecture for GEMM by conserving as longly as possible blocks in the L3 cache ? As we have 128 MB / 160 wavefronts ~= 800 KB per wavefront (160 wavefronts = 80 CU * 2 concurrent 32-lane wavefronts per CU). It is not far away from the L2 cache we found on CPU (Ryzen 5xxx series: 512 KB L2 cache).

bragadeesh commented 1 year ago

Yes Tensile has support for RDNA2, assigning this to @TonyYHsieh for further support

ppanchad-amd commented 10 hours ago

@v01dXYZ Do you still need assistance with this ticket? If not, please close the ticket. Thanks!