fpgaminer / GPTQ-triton

GPTQ inference Triton kernel
Apache License 2.0
285 stars 23 forks source link

Can I use a CUDA kernel with a model quantized using triton & vice-versa? #16

Closed vedantroy closed 1 year ago

vedantroy commented 1 year ago

I'm assuming that it is OK to:

Is this true?

fpgaminer commented 1 year ago

If by CUDA you mean GPTQ-for-LLaMa or similar, then the algorithm is the same, and the raw data is the same, but the format the quantized model is stored in might be slightly different.

vedantroy commented 1 year ago

If by CUDA you mean GPTQ-for-LLaMa or similar, then the algorithm is the same, and the raw data is the same, but the format the quantized model is stored in might be slightly different.

I'm assuming that if I store everything in the .safetensors format, it should be fine / interchangable? Maybe this is not the case?

fpgaminer commented 1 year ago

I meant more that, for example, my quantize script stores values in float16 since that's what the Triton kernels expect, whereas historically GPTQ-for-LLaMa stored and used them as float32.