ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.2k stars 3.59k forks source link

Running `convert-h5-to-ggml.py` on distil-whisper gives a HeaderTooLarge error #1711

Open PoignardAzur opened 9 months ago

PoignardAzur commented 9 months ago

I followed the steps outlined in models/README.md:

git clone https://github.com/ggerganov/whisper.cpp
git clone https://huggingface.co/distil-whisper/distil-medium.en
git clone https://huggingface.co/distil-whisper/distil-large-v2

# convert to ggml
python3 whisper.cpp/models/convert-h5-to-ggml.py distil-medium.en/ whisper.cpp/ some-output-folder/

Doing so gives me the following stack trace:

Traceback (most recent call last):
  File "/home/olivier-faure/Documents/whisper.cpp/models/convert-h5-to-ggml.py", line 87, in <module>
    model = WhisperForConditionalGeneration.from_pretrained(dir_model)
  File "/home/olivier-faure/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3372, in from_pretrained
    with safe_open(resolved_archive_file, framework="pt") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge

I've confirmed with print-debugging that the file being opened is distil-medium.en/model.safetensors. I have the same problem with distil-large-v2.

bobqianic commented 9 months ago

python3 whisper.cpp/models/convert-h5-to-ggml.py distil-medium.en/ whisper.cpp/ some-output-folder/

Replace whisper.cpp/ with the path to the OpenAI Whisper repository.

See https://github.com/ggerganov/whisper.cpp/discussions/1414#discussioncomment-7461216

PoignardAzur commented 9 months ago

Well oops.

I still get the same error though:

> python3 whisper.cpp/models/convert-h5-to-ggml.py distil-medium.en/ whisper/ ~/Documents/models/
(log I added) filepath:  /home/olivier-faure/Documents/distil-medium.en/model.safetensors
Traceback (most recent call last):
  File "/home/olivier-faure/Documents/whisper.cpp/models/convert-h5-to-ggml.py", line 87, in <module>
    model = WhisperForConditionalGeneration.from_pretrained(dir_model)
  File "/home/olivier-faure/Documents/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3372, in from_pretrained
    with safe_open(resolved_archive_file, framework="pt") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
samolego commented 7 months ago

Hi, I've finetuned whisper myself for Slovenian laguage. I'm running command:

python ./whisper.cpp/models/convert-h5-to-ggml.py ./whisper-small-sl-mozilla ./whisper .

But also get the HeaderTooLarge error:

Traceback (most recent call last):
  File "/home/samoh/Documents/school/3.letnik/diploma/./whisper.cpp/models/convert-h5-to-ggml.py", line 87, in <module>
    model = WhisperForConditionalGeneration.from_pretrained(dir_model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/samoh/Documents/school/3.letnik/diploma/test_whisper/.venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3284, in from_pretrained
    with safe_open(resolved_archive_file, framework="pt") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge