issues
search
Bruce-Lee-LY
/
cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
MIT License
290
stars
66
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
enable_check 1 结果不对
#12
cokeshao
closed
2 months ago
2
wmma下A矩阵采用padding 8好像没有完全解决bank conflict问题?
#11
luliyucoordinate
closed
5 months ago
0
为什么B矩阵要transpose?
#10
luliyucoordinate
closed
5 months ago
0
请教一个 `wmma_async_stage2.cu` 中的代码细节
#9
luliyucoordinate
closed
5 months ago
0
关于permute实现方式
#8
feiyuvl
closed
9 months ago
2
关于A/B阵的Layout
#7
feiyuvl
closed
9 months ago
1
Question about the tiling size
#6
macto94
closed
10 months ago
2
Cooperative Async Copies
#5
FabianSchuetze
closed
10 months ago
2
咨询:Share Mem bank Confict.
#4
matrix97317
closed
10 months ago
1
Change to block of 128 by 256
#3
yupei-ms
closed
1 year ago
3
#define CHUNK_K 2 // 32 / WMMA_K
#2
lk137095576
closed
1 year ago
1
mma_naive结果不正确
#1
FdyCN
closed
1 year ago
1