MiscellaneousStuff / openai-whisper-cpu

Improving transcription performance of OpenAI Whisper for CPU based deployment
MIT License
237 stars 19 forks source link

Question about the minimal required changes for CPU improvement #3

Closed albertofernandezvillan closed 2 years ago

albertofernandezvillan commented 2 years ago

So If I have understood correctly, the only necessary change is to change Linear() layer in the OpenAI Whisper model to nn.Linear() in whisper/whisper/model.py and then, perform the dynamic quantization:

quantized_model = torch.quantization.quantize_dynamic(
    model_fp32, {torch.nn.Linear}, dtype=torch.qint8
)

Is it correct, or are there any additional changes that should be performed for improving transcription performance of OpenAI Whisper for CPU based deployment?

Thanks

MiscellaneousStuff commented 2 years ago

That’s correct. The Linear() layer is a custom layer which is provided by the OpenAI Whisper authors to deal with different model data types (fp32, fp16) as the model can be run at different precisions. However, the Dynamic Quantization code traverses the different layers and quantises them based on a list of layers it is told to quantize. This list must be based on builtin torch.nn layers. So when it sees Linear() it does not recognise it. So when this is changed to torch.nn.linear(), the model is then in a format which the quantization module recognises and no further modifications are necessary.