Closed siddhsql closed 1 year ago
The answer is yes, please check llmtune and falcontune projects in the github.
If the quantized model is kept static, e.g. as in QLoRa, and you are just finetuning biases/scales/adapters/etc. you can generally perform the quantization in whatever way you want, i.e. also using GPTQ. This seems to be exactly what the llmtune
project mentioned by wyklq in the previous comment is doing.
On the other hand, for applications where you want to fully requantize the whole model in each step and thus require an extremely fast quantizer, like in full quantization aware training, GPTQ will probably be a bit too slow.
I think the answer is no but wanted to check. can some expert let me know? thanks.