IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Apache License 2.0
544 stars 42 forks source link

Support for Hopper H100 #7

Open rosario-purple opened 7 months ago

rosario-purple commented 7 months ago

Hi! You've probably already considered this, but would you be able to add support for Hopper H100 GPUs? A100s don't have nearly as much memory bandwidth. Am happy to run tests/benchmarks on one if that would help, thanks

Ageliss commented 7 months ago

I had a bench test on H800, maybe a little bit slower than H100. Hope it could help.

image
Ageliss commented 7 months ago

Also, I had another question that how marlin performs comparing with TRT-LLM : device void weight_only_batched_gemv() https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernel.h#L296

Recently, a NIPS paper called Quip also shared a version of W2~W4.GEMM, it seems marlin and Quip both use a similar mma but very different with TRT-LLM. Quip decompress: https://github.com/Cornell-RelaxML/quip-sharp/blob/cd1949525722fa9b201af7a8c96841cbbd046b4c/quiptools/quiptools_e8p_gemv.cu

Any comments on the difference and performance?

Qubitium commented 5 months ago

@Ageliss Can you confirm the benchmark result you posted of llama 7B and 65B is on H800 with Marlin kernel? Thank you. Can you also run the marlin kernel bench in bench.py and test.py on H800? Thank you! don't have H100 but would like to test/validate H100/H800 for autogptq library.