Is there any chance of a future integration with CTransformers or something similar to allow for guided generation using quantized models on CPU? If I were to try and hack away at this, what would be the best approach?
It is definitely possible if the library exposes the forward pass of the model to get the logits. You just have to implement the same interface as in this file and it should work out of the box.
Is there any chance of a future integration with CTransformers or something similar to allow for guided generation using quantized models on CPU? If I were to try and hack away at this, what would be the best approach?