Is it possible to take advantage of the quantization while using a separate fork?

petiatil commented 1 year ago

For instance:

The following appears to work with a separate fork (which I need for its own augmentations), but is it taking advantage of the quantization?

For a large file, I could only test the quantized model, as it the other fork appears to have an error transcribing.

If the following doesn't work, is there a direct way to do it?

model_fp32 = **randomWhisperFork**.load_model(
    name="small.en",
    device="cpu"
#   ,in_memory=True
)

quantized_model = torch.quantization.quantize_dynamic(
    model_fp32, {torch.nn.Linear}, dtype=torch.qint8
)

result = quantized_model.transcribe(audio_file_path)

MiscellaneousStuff commented 1 year ago

Hello there. As long as your fork of the original “whisper” repo from OpenAI has the Linear() layers replaced with nn.Linear() layers such as in https://github.com/MiscellaneousStuff/whisper/blob/main/whisper/model.py it will work. That is the only change required to get quantisation to work and then your code snippet provided will work. This repo (openai-whisper-cpu) is not required at all to get quantisation to work, it only provides an example with benchmarks.

baxcster commented 6 months ago

Hello there. As long as your fork of the original “whisper” repo from OpenAI has the Linear() layers replaced with nn.Linear() layers such as in https://github.com/MiscellaneousStuff/whisper/blob/main/whisper/model.py it will work. That is the only change required to get quantisation to work and then your code snippet provided will work. This repo (openai-whisper-cpu) is not required at all to get quantisation to work, it only provides an example with benchmarks.

Thank you for creating this fork! I started using Whisper to translate things locally on my system when I didn't have a compatible GPU, which is what lead me to your fork. Initially. Since then, I've observed that the results of Chinese/Mandarin translations using the large model on your fork Whisper are often MORE accurate than the results using vanilla Whisper in every official release of Whisper from OpenAI that I've tried!

I'm guessing these improved results from your Whisper fork are due to the quantization.

I'd like to get quantization to work on vanilla Whisper (current v20231117) to take advantage of the updates to the project and a more modern CUDA compatible GPU I've upgraded to, but I'm less technically savy than the OP here and I'm unsure how to do it. Currently I don't even utilize a Python script to run Whisper--I just run it on a Linux system from a Python environment via CLI (eg. whisper [file path] --device cuda --model "large" --language Mandarin --task translate), which works great for me on vanilla Whisper and your fork (not using cuda of course).

Could you possibly give me the "idiot's guide" for how to use quantization on the modern vanilla/official OpenAI Whisper? I'm guessing I'd need a custom Python script to run Whisper to get quantization working, but it seems I'd have to modify things beyond that as well? Thanks for your time!!

MiscellaneousStuff / openai-whisper-cpu

Is it possible to take advantage of the quantization while using a separate fork? #6