Short-term. We need to monkeypatch Transformers so AutoModelForCasual.from_pretrained() hook to AutoGPTQ is routed to GPTQModel instead.
For monkey patch there are two paths:
Directly monkey patch Transformer code.
Monkey patch AutoGPTQ.from_quantized() class method so it is routed to GPTQModel.from_quantized() instead when Transformers does the hook call.
Mid-term. We should also submit PR to Transformers so the quant (AutoGPTQ) integration is a dynamic hook, not a static bound to any pkg. For this to happen, we need to design a shared generic api/hook structure so that GPTQModel and AutoGPTQ can co-exist. in-addition to any future quant packages that would want to hook into the loader/inference.
Short-term. We need to monkeypatch Transformers so
AutoModelForCasual.from_pretrained()
hook to AutoGPTQ is routed to GPTQModel instead.For monkey patch there are two paths:
AutoGPTQ.from_quantized()
class method so it is routed toGPTQModel.from_quantized()
instead when Transformers does the hook call.Mid-term. We should also submit PR to Transformers so the quant (AutoGPTQ) integration is a dynamic hook, not a static bound to any pkg. For this to happen, we need to design a shared generic api/hook structure so that GPTQModel and AutoGPTQ can co-exist. in-addition to any future quant packages that would want to hook into the loader/inference.
Target: v0.9.2