Closed cmdicely closed 1 year ago
It is intended. First, you convert PyTorch pth
to bin
to store model weights as-is. And then, optionally, you quantize the bin
file into some other format.
Conversion and quantization stages are split to allow quantization on lower RAM devices. If quantization always required a PyTorch file, it would need to completely read it to RAM, which may not be possible for bigger models.
On Windows 11 installed per instructions, conversion seems to only support float16/float32, not quantized formats.