Open NicoNico6 opened 3 months ago
@NicoNico6 The w4a8 GEMM kernel only supports group_size=128
for now. If you want to support more group size, you need to design new GEMM kernels, which may be a little complicated.
@NicoNico6 The kernel requires group_size % thread_k == 0. As thread_k is either 128 or 64 (refer to the following code), the group_size should be a multiple of 128, such as 256.
if (thread_k == -1 || thread_n == -1) {
if (prob_m <= 16) {
// For small batchizes, better partioning is slightly more important than better compute utilization
thread_k = 128;
thread_n = 128;
} else {
thread_k = 64;
thread_n = 256;
}
}
CALL_IF(1, 8, 8, 16) CALL_IF(1, 16, 4, 16) CALL_IF(2, 16, 4, 16) CALL_IF(3, 16, 4, 16) CALL_IF(4, 16, 4, 16)
然后再跑test_w4a8.py, 其中groupsize=256,但是还是报错
`FAIL: test_groups (main.Test)Traceback (most recent call last): File "/workspace/marlin/test_w4a8.py", line 178, in test_groups self.run_problem(m, n, k, *thread_shape, groupsize) File "/workspace/marlin/test_w4a8.py", line 87, in run_problem self.assertLess(torch.mean(torch.abs(D - D_ref)) / torch.mean(torch.abs(D_ref)), 0.003) AssertionError: tensor(1.4229, device='cuda:0', dtype=torch.float16) not less than 0.003
Ran 6 tests in 53.540s
FAILED (failures=1)` 请问代码还需要修改哪里吗?
您好,我在用A6000, 3090, 4090测试w4a8的marlin与cublas float32进行对比的时候,发现marlin比cublas fp32慢4~6倍,但是我用A100显卡发现和marlin论文差不多的性能,请问这是什么原因呢?
@darrenearl 我认为有可能是这些卡显存带宽限制了性能,像4090只有A100的一半带宽。我建议你可以用nsight-compute去看一下kernel的瓶颈,这样更好分析。
@darrenearl 我认为有可能是这些卡的HBM的带宽限制了性能,像4090只有A100的一半带宽。我建议你可以用nsight-compute去看一下kernel的瓶颈,这样更好分析。
好的,感谢
Hi, thanks for your great work and the open decision. I am trying different quantization group size (128 to 64/32) by changing the default hyperparameter '''group_size''', but the GEMM results are Nan compared to group size 128.
Can you share some idea for this issue, how can I support various group size?