Closed vedantroy closed 1 year ago
If by CUDA you mean GPTQ-for-LLaMa
or similar, then the algorithm is the same, and the raw data is the same, but the format the quantized model is stored in might be slightly different.
If by CUDA you mean
GPTQ-for-LLaMa
or similar, then the algorithm is the same, and the raw data is the same, but the format the quantized model is stored in might be slightly different.
I'm assuming that if I store everything in the .safetensors
format, it should be fine / interchangable? Maybe this is not the case?
I meant more that, for example, my quantize script stores values in float16 since that's what the Triton kernels expect, whereas historically GPTQ-for-LLaMa
stored and used them as float32.
I'm assuming that it is OK to:
Is this true?