ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
34.86k stars 3.55k forks source link

Add scripts to convert models from huggingface hub #325

Open R4ZZ3 opened 1 year ago

R4ZZ3 commented 1 year ago

Hi,

I made a small demo here https://huggingface.co/spaces/RASMUS/Whisper-youtube-crosslingual-subtitles which uses these models.

Now I am trying to convert models created during Huggingface Whisper finetuning event to be used with this implementation. I am not sure if I am doing this correctly but I would like to see more streamlined implementation directly in this repo as a script.

I started from this model: https://huggingface.co/ales/whisper-small-belarusian

Download the model with HF instructions image

Save to disk and convert to .pt image

Then when trying to transform: image

Trying to add dims to dict object: image

Running conversion again, model_state_dict error image

Assume original checkpoint is actually state dict and creating new object: image

Run conversion: image

So now it succeeds. I have yet to test whether that actually works but I would like to have this kind of conversion directly in this repo :)

R4ZZ3 commented 1 year ago

image Seems that my conversion does not work. Any ideas @ggerganov ?

ggerganov commented 1 year ago

You have to match the tensor names to the ones used by whisper.cpp:

https://github.com/ggerganov/whisper.cpp/blob/4e0b2069e7cc93a72cc9446ee27841e77abb927b/whisper.cpp#L823-L844

R4ZZ3 commented 1 year ago

I tried converting the keys with the following script and conversion script runs smoothly:

new_checkpoint = {}
decoder
test_keys = []
i = 0
for key, value in checkpoint.items():

    old_key = key
    if 'decoder.' in key:
        i +=1
        print(i)
        if 'model' in key:
            key = key.replace('model.','')
        if 'embed_positions' in key:
            key = key.replace('embed_positions','positional_embedding')
            key = key.replace('.weight','')
        elif 'embed_tokens' in key:
            key = key.replace('embed_tokens','token_embedding')     
        elif 'self_attn.k_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.k_proj.weight','attn.key.weight')
        elif 'self_attn.v_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.v_proj.weight','attn.value.weight')
        elif 'self_attn.v_proj.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.v_proj.bias','attn.value.bias')
        elif 'self_attn.q_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.q_proj.weight','attn.query.weight')
        elif 'self_attn.q_proj.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.q_proj.bias', 'attn.query.bias')
        elif 'self_attn.out_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.out_proj.weight','attn.out.weight')
        elif 'self_attn.out_proj.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.out_proj.bias','attn.out.bias')
        elif 'self_attn_layer_norm.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn_layer_norm.weight','attn_ln.weight') 
        elif 'self_attn_layer_norm.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn_layer_norm.bias', 'attn_ln.bias')
        elif 'final_layer_norm.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('final_layer_norm.weight','mlp_ln.weight')
        elif 'final_layer_norm.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('final_layer_norm.bias', 'mlp_ln.bias') 
        elif 'fc1.weight' in key:
            key = key.replace('layers','blocks')
            key = key.replace('fc1.weight', 'mlp.0.weight')
        elif 'fc1.bias' in key:
            key = key.replace('layers','blocks')
            key = key.replace('fc1.bias', 'mlp.0.bias')
        elif 'fc2.weight' in key:
            key = key.replace('layers','blocks')
            key = key.replace('fc2.weight', 'mlp.2.weight')
        elif 'fc2.bias' in key:
            key = key.replace('layers','blocks')
            key = key.replace('fc2.bias', 'mlp.2.bias')
        elif 'final_layer_norm.weight' in key:
            key = key.replace('layers','blocks')
            key = key.replace('final_layer_norm.weight', 'mlp_ln.weight')
        elif 'final_layer_norm.bias' in key:
            key = key.replace('layers','blocks')
            key = key.replace('final_layer_norm.bias', 'mlp_ln.bias')
        elif 'encoder_attn.k_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('encoder_attn.k_proj.weight','cross_attn_ln.weight')
        elif 'encoder_attn.v_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('encoder_attn.v_proj.weight','cross_attn_ln.bias')
        elif 'encoder_attn.v_proj.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('encoder_attn.v_proj.bias','cross_attn.query.weight')
        elif 'encoder_attn.q_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('encoder_attn.q_proj.weight','cross_attn.query.bias')
        elif 'encoder_attn.q_proj.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('encoder_attn.q_proj.bias','cross_attn.key.weight')
        elif 'encoder_attn.out_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('encoder_attn.out_proj.weight','cross_attn.value.weight')
        elif 'encoder_attn.out_proj.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('encoder_attn.out_proj.bias','cross_attn.value.bias')
        elif 'encoder_attn_layer_norm.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('encoder_attn_layer_norm.weight','cross_attn.out.weight')
        elif 'encoder_attn_layer_norm.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('encoder_attn_layer_norm.bias','cross_attn.out.bias')
        elif key.startswith('decoder.layer_norm'):
            key = key.replace('layer_norm', 'ln')

    elif 'encoder.' in key:
        if 'model' in key:
            key = key.replace('model.','')
        if 'embed_positions' in key:
            key = key.replace('embed_positions','positional_embedding')
            key = key.replace('.weight','')
        elif 'embed_tokens' in key:
            key = key.replace('embed_tokens','token_embedding')     
        elif 'self_attn.k_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.k_proj.weight','attn.key.weight')
        elif 'self_attn.v_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.v_proj.weight','attn.value.weight')
        elif 'self_attn.v_proj.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.v_proj.bias','attn.value.bias')
        elif 'self_attn.q_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.q_proj.weight','attn.query.weight')
        elif 'self_attn.q_proj.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.q_proj.bias', 'attn.query.bias')
        elif 'self_attn.out_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.out_proj.weight','attn.out.weight')
        elif 'self_attn.out_proj.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.out_proj.bias','attn.out.bias')
        elif 'self_attn_layer_norm.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn_layer_norm.weight','attn_ln.weight') 
        elif 'self_attn_layer_norm.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn_layer_norm.bias', 'attn_ln.bias')
        elif 'final_layer_norm.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('final_layer_norm.weight','mlp_ln.weight')
        elif 'final_layer_norm.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('final_layer_norm.bias', 'mlp_ln.bias') 
        elif 'fc1.weight' in key:
            key = key.replace('layers','blocks')
            key = key.replace('fc1.weight', 'mlp.0.weight')
        elif 'fc1.bias' in key:
            key = key.replace('layers','blocks')
            key = key.replace('fc1.bias', 'mlp.0.bias')
        elif 'fc2.weight' in key:
            key = key.replace('layers','blocks')
            key = key.replace('fc2.weight', 'mlp.2.weight')
        elif 'fc2.bias' in key:
            key = key.replace('layers','blocks')
            key = key.replace('fc2.bias', 'mlp.2.bias')
        elif 'final_layer_norm.weight' in key:
            key = key.replace('layers','blocks')
            key = key.replace('final_layer_norm.weight', 'mlp_ln.weight')
        elif 'final_layer_norm.bias' in key:
            key = key.replace('layers','blocks')
            key = key.replace('final_layer_norm.bias', 'mlp_ln.bias')
        elif key.startswith('encoder.layer_norm'):
            key = key.replace('layer_norm', 'ln_post')  
        # print(f'{old_key} --> {key}')
        # if key not in key_list_in_whisper:
        #     print("KEY NOT FOUND")

    new_checkpoint[key] = value

ENCODER

test_keys = []
i = 0
for key, value in checkpoint.items():
    old_key = key
    if 'encoder.' in key:
        i +=1
        print(i)
        if 'model' in key:
            key = key.replace('model.','')
        if 'embed_positions' in key:
            key = key.replace('embed_positions','positional_embedding')
            key = key.replace('.weight','')
        elif 'embed_tokens' in key:
            key = key.replace('embed_tokens','token_embedding')     
        elif 'self_attn.k_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.k_proj.weight','attn.key.weight')
        elif 'self_attn.v_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.v_proj.weight','attn.value.weight')
        elif 'self_attn.v_proj.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.v_proj.bias','attn.value.bias')
        elif 'self_attn.q_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.q_proj.weight','attn.query.weight')
        elif 'self_attn.q_proj.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.q_proj.bias', 'attn.query.bias')
        elif 'self_attn.out_proj.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.out_proj.weight','attn.out.weight')
        elif 'self_attn.out_proj.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn.out_proj.bias','attn.out.bias')
        elif 'self_attn_layer_norm.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn_layer_norm.weight','attn_ln.weight') 
        elif 'self_attn_layer_norm.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('self_attn_layer_norm.bias', 'attn_ln.bias')
        elif 'final_layer_norm.weight' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('final_layer_norm.weight','mlp_ln.weight')
        elif 'final_layer_norm.bias' in key:
            key = key.replace('layers', 'blocks')
            key = key.replace('final_layer_norm.bias', 'mlp_ln.bias') 
        elif 'fc1.weight' in key:
            key = key.replace('layers','blocks')
            key = key.replace('fc1.weight', 'mlp.0.weight')
        elif 'fc1.bias' in key:
            key = key.replace('layers','blocks')
            key = key.replace('fc1.bias', 'mlp.0.bias')
        elif 'fc2.weight' in key:
            key = key.replace('layers','blocks')
            key = key.replace('fc2.weight', 'mlp.2.weight')
        elif 'fc2.bias' in key:
            key = key.replace('layers','blocks')
            key = key.replace('fc2.bias', 'mlp.2.bias')
        elif 'final_layer_norm.weight' in key:
            key = key.replace('layers','blocks')
            key = key.replace('final_layer_norm.weight', 'mlp_ln.weight')
        elif 'final_layer_norm.bias' in key:
            key = key.replace('layers','blocks')
            key = key.replace('final_layer_norm.bias', 'mlp_ln.bias')
        elif key.startswith('encoder.layer_norm'):
            key = key.replace('layer_norm', 'ln_post')

        print(f'{old_key} --> {key}')
        if key not in key_list_in_whisper:
            print("KEY NOT FOUND")

small_dims = {'n_mels': 80,
  'n_vocab': 51865,
  'n_audio_ctx': 1500,
  'n_audio_state': 768,
  'n_audio_head': 12,
  'n_audio_layer': 12,
  'n_text_ctx': 448,
  'n_text_state': 768,
  'n_text_head': 12,
  'n_text_layer': 12}

checkpoint['dims'] = small_dims`

object_with_dims_and_state_dict = {}
object_with_dims_and_state_dict['model_state_dict'] = new_checkpoint
object_with_dims_and_state_dict['dims'] = small_dims

torch.save(object_with_dims_and_state_dict, 'testaa.pt')

--> Then run conversion --> succeeds

--> run model on test sample

image

R4ZZ3 commented 1 year ago

Then also @baya

Created a similar script here: https://colab.research.google.com/github/Vaibhavs10/notebooks/blob/main/transformers_whisper_ckpt_to_OAI.ipynb

Which I used but get the same result:

My conversion command: python models/convert-pt-to-ggml.py (path to conversion script) /mnt/f/Omat_opiskelut/whisper_transformaatio/whisper.cpp/openai_whisper/flozi00_whisper-small-german_OAI (path to model to convert) /mnt/f/Omat_opiskelut/whisper_transformaatio/whisper.cpp/openai_whisper/whisper (path to openai repo) ./models/testaa (path for outputting new model)

run command: ./main -m models/testaa/ggml-model.bin -f samples/jfk.wav

ggerganov commented 1 year ago

Maybe uncomment the following print to get some more info and try to debug:

https://github.com/ggerganov/whisper.cpp/blob/00ea21668b7db98e0530324c0bc1bff53df6995c/whisper.cpp#L1194