Support for Hopper H100

rosario-purple commented 7 months ago

Hi! You've probably already considered this, but would you be able to add support for Hopper H100 GPUs? A100s don't have nearly as much memory bandwidth. Am happy to run tests/benchmarks on one if that would help, thanks

Ageliss commented 7 months ago

I had a bench test on H800, maybe a little bit slower than H100. Hope it could help.

Ageliss commented 7 months ago

Also, I had another question that how marlin performs comparing with TRT-LLM : device void weight_only_batched_gemv() https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernel.h#L296

Recently, a NIPS paper called Quip also shared a version of W2~W4.GEMM, it seems marlin and Quip both use a similar mma but very different with TRT-LLM. Quip decompress: https://github.com/Cornell-RelaxML/quip-sharp/blob/cd1949525722fa9b201af7a8c96841cbbd046b4c/quiptools/quiptools_e8p_gemv.cu

Any comments on the difference and performance?

Qubitium commented 5 months ago

@Ageliss Can you confirm the benchmark result you posted of llama 7B and 65B is on H800 with Marlin kernel? Thank you. Can you also run the marlin kernel bench in bench.py and test.py on H800? Thank you! don't have H100 but would like to test/validate H100/H800 for autogptq library.

IST-DASLab / marlin

Support for Hopper H100 #7