I found code as below in quantize.py, it seems like the quantization_code only support running on GPU.
Is there any suggestion to deploy the model on NPU.
Is it possible that you can provide the code in the quantization_code and maybe I can rewrite it to support running on a NPU device.
class Kernel:
def __init__(self, code: bytes, function_names: List[str]):
self.code = code
self._function_names = function_names
self._cmodule = LazyKernelCModule(self.code)
for name in self._function_names:
setattr(self, name, KernelFunction(self._cmodule, name))
quantization_code = "XXXX"
kernels = Kernel(
bz2.decompress(base64.b64decode(quantization_code)),
[
"int4WeightCompression",
"int4WeightExtractionFloat",
"int4WeightExtractionHalf",
"int8WeightExtractionFloat",
"int8WeightExtractionHalf",
],
)
Is there an existing issue for this?
Current Behavior
I found code as below in quantize.py, it seems like the quantization_code only support running on GPU. Is there any suggestion to deploy the model on NPU. Is it possible that you can provide the code in the quantization_code and maybe I can rewrite it to support running on a NPU device.