Open rosario-purple opened 7 months ago
I had a bench test on H800, maybe a little bit slower than H100. Hope it could help.
Also, I had another question that how marlin performs comparing with TRT-LLM : device void weight_only_batched_gemv() https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernel.h#L296
Recently, a NIPS paper called Quip also shared a version of W2~W4.GEMM, it seems marlin and Quip both use a similar mma but very different with TRT-LLM. Quip decompress: https://github.com/Cornell-RelaxML/quip-sharp/blob/cd1949525722fa9b201af7a8c96841cbbd046b4c/quiptools/quiptools_e8p_gemv.cu
Any comments on the difference and performance?
@Ageliss Can you confirm the benchmark result you posted of llama 7B and 65B is on H800 with Marlin kernel? Thank you. Can you also run the marlin kernel bench in bench.py
and test.py
on H800? Thank you! don't have H100 but would like to test/validate H100/H800 for autogptq library.
Hi! You've probably already considered this, but would you be able to add support for Hopper H100 GPUs? A100s don't have nearly as much memory bandwidth. Am happy to run tests/benchmarks on one if that would help, thanks