Running Encoder and Decoder model separately

ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++

MIT License

34.24k stars 3.47k forks source link

Running Encoder and Decoder model separately #1801

Open rojinbakhti opened 7 months ago

rojinbakhti commented 7 months ago

Hi, Can I use the API's to convert and run my whisper encoder and decoder models separately? If yes, how do I do this?

bobqianic commented 7 months ago

Yes, that's possible. However, you can't run the encoder with a small model and then switch to using a large model for the decoder. This is because the dimensions of the cross-attention tensors won't be compatible between the two different model sizes.

https://github.com/ggerganov/whisper.cpp/blob/1cf679dec4eca99aeaed4fe09a8092803bdecfc1/whisper.h#L259-L293

rojinbakhti commented 7 months ago

Thanks for the reply, If I have my encoder output (encoder hidden states) values in a float array, let's say from running the encoder part using tensorflow api's, is there a way to feed it into the whisper_decode? (ie calling whisper_decode wihout calling whisper_encode first, assuming I have the encoder output + tokens I want to feed the decoder beforehand).

rojinbakhti commented 7 months ago

@bobqianic Hi, have you had a chance to look at this question? Thanks.

rojinbakhti commented 7 months ago

@ggerganov

bobqianic commented 7 months ago

@bobqianic Hi, have you had a chance to look at this question? Thanks.

You'll need to manually modify wstate.kv_cross.

Essentially, this portion of the encoder's code is designed to directly copy the intermediate tensors Kcross and Vcross into wstate.kv_cross.k and wstate.kv_cross.v. Otherwise, the intermediate tensor would be overwritten during subsequent executions.

https://github.com/ggerganov/whisper.cpp/blob/7a74e929c842489010f641156f2a5ac733b17016/whisper.cpp#L2099-L2108

In this section of the decoder's code, wstate.kv_cross.k and wstate.kv_cross.v are used.

https://github.com/ggerganov/whisper.cpp/blob/7a74e929c842489010f641156f2a5ac733b17016/whisper.cpp#L2411-L2434

bobqianic commented 7 months ago

whisper_init_from_file_with_params() -> whisper_init_state() -> modify state -> whisper_decode_with_state()

ggerganov commented 7 months ago

@rojinbakhti As @bobqianic correctly explained, it's not possible because the cross-attention KV cache will not be populated if you feed the encoder output directly to the whisper_decode(). One option to make this possible is to expose a whisper_cross() API that precomputes cross-attention cache. It shouldn't be a difficult change, but I'm not sure if I'll be able to take a look soon. Hopefully someone from the community can give it a try

Otherwise, the intermediate tensor would be overwritten during subsequent executions.

@bobqianic Not only that, but this allows to call whisper_decode() multiple times and reuse the cross attention instead of computing it each time