Possibility of using different group size setting

HandH1998 / QQQ

QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.

91 stars 8 forks source link

Possibility of using different group size setting #9

Open NicoNico6 opened 3 months ago

NicoNico6 commented 3 months ago

Hi, thanks for your great work and the open decision. I am trying different quantization group size (128 to 64/32) by changing the default hyperparameter '''group_size''', but the GEMM results are Nan compared to group size 128.

Can you share some idea for this issue, how can I support various group size?

HandH1998 commented 3 months ago

@NicoNico6 The w4a8 GEMM kernel only supports group_size=128 for now. If you want to support more group size, you need to design new GEMM kernels, which may be a little complicated.

HandH1998 commented 3 months ago

@NicoNico6 The kernel requires group_size % thread_k == 0. As thread_k is either 128 or 64 (refer to the following code), the group_size should be a multiple of 128, such as 256.

if (thread_k == -1 || thread_n == -1) {
    if (prob_m <= 16) {
      // For small batchizes, better partioning is slightly more important than better compute utilization
      thread_k = 128;
      thread_n = 128;
    } else {
      thread_k = 64;
      thread_n = 256;
    }
  }

darrenearl commented 3 months ago

我在marlin核函数增加了以下代码： `CALL_IF(1, 8, 8, 16) CALL_IF(1, 16, 4, 16) CALL_IF(2, 16, 4, 16) CALL_IF(3, 16, 4, 16) CALL_IF(4, 16, 4, 16)` 然后再跑test_w4a8.py, 其中groupsize=256，但是还是报错 `FAIL: test_groups (main.Test)

Traceback (most recent call last): File "/workspace/marlin/test_w4a8.py", line 178, in test_groups self.run_problem(m, n, k, *thread_shape, groupsize) File "/workspace/marlin/test_w4a8.py", line 87, in run_problem self.assertLess(torch.mean(torch.abs(D - D_ref)) / torch.mean(torch.abs(D_ref)), 0.003) AssertionError: tensor(1.4229, device='cuda:0', dtype=torch.float16) not less than 0.003

Ran 6 tests in 53.540s

FAILED (failures=1)` 请问代码还需要修改哪里吗？

darrenearl commented 2 months ago

您好，我在用A6000, 3090, 4090测试w4a8的marlin与cublas float32进行对比的时候，发现marlin比cublas fp32慢4~6倍，但是我用A100显卡发现和marlin论文差不多的性能，请问这是什么原因呢？

HandH1998 commented 2 months ago

@darrenearl 我认为有可能是这些卡显存带宽限制了性能，像4090只有A100的一半带宽。我建议你可以用nsight-compute去看一下kernel的瓶颈，这样更好分析。

darrenearl commented 2 months ago

@darrenearl 我认为有可能是这些卡的HBM的带宽限制了性能，像4090只有A100的一半带宽。我建议你可以用nsight-compute去看一下kernel的瓶颈，这样更好分析。

好的，感谢

HandH1998 / QQQ

Possibility of using different group size setting #9

我在marlin核函数增加了以下代码： CALL_IF(1, 8, 8, 16) CALL_IF(1, 16, 4, 16) CALL_IF(2, 16, 4, 16) CALL_IF(3, 16, 4, 16) CALL_IF(4, 16, 4, 16) 然后再跑test_w4a8.py, 其中groupsize=256，但是还是报错 `FAIL: test_groups (main.Test)

我在marlin核函数增加了以下代码： `CALL_IF(1, 8, 8, 16) CALL_IF(1, 16, 4, 16) CALL_IF(2, 16, 4, 16) CALL_IF(3, 16, 4, 16) CALL_IF(4, 16, 4, 16)` 然后再跑test_w4a8.py, 其中groupsize=256，但是还是报错 `FAIL: test_groups (main.Test)