IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Apache License 2.0
575 stars 45 forks source link

[Bug] H800 run UT failed. #6

Open Ageliss opened 8 months ago

Ageliss commented 8 months ago
image

This setup can not pass UT. Could you please check it ?

efrantar commented 8 months ago

Hi, unfortunately, I don't have access to any H800s (or any Hopper GPUs for that matter), so it is a bit hard to test. Which of the matrix shapes are failing and by how much? Can you perhaps print the result of this line for all test cases, i.e., what is the relative average error?

Ageliss commented 8 months ago

Hi, unfortunately, I don't have access to any H800s (or any Hopper GPUs for that matter), so it is a bit hard to test. Which of the matrix shapes are failing and by how much? Can you perhaps print the result of this line for all test cases, i.e., what is the relative average error?

Yes, if the thread_shape = [64, 256], I get the right thing:

image

However, as for [128, 128], I get the error:

image
Qubitium commented 6 months ago

@Ageliss Which cuda version was the failed test ran on? Can you retest on latest Cuda 12.4 and/or pytorch 2.2.2?