Open rojinbakhti opened 7 months ago
Yes, that's possible. However, you can't run the encoder with a small model and then switch to using a large model for the decoder. This is because the dimensions of the cross-attention tensors won't be compatible between the two different model sizes.
Thanks for the reply, If I have my encoder output (encoder hidden states) values in a float array, let's say from running the encoder part using tensorflow api's, is there a way to feed it into the whisper_decode? (ie calling whisper_decode wihout calling whisper_encode first, assuming I have the encoder output + tokens I want to feed the decoder beforehand).
@bobqianic Hi, have you had a chance to look at this question? Thanks.
@ggerganov
@bobqianic Hi, have you had a chance to look at this question? Thanks.
You'll need to manually modify wstate.kv_cross
.
Essentially, this portion of the encoder's code is designed to directly copy the intermediate tensors Kcross
and Vcross
into wstate.kv_cross.k
and wstate.kv_cross.v
. Otherwise, the intermediate tensor would be overwritten during subsequent executions.
In this section of the decoder's code, wstate.kv_cross.k
and wstate.kv_cross.v
are used.
whisper_init_from_file_with_params()
-> whisper_init_state()
-> modify state
-> whisper_decode_with_state()
@rojinbakhti As @bobqianic correctly explained, it's not possible because the cross-attention KV cache will not be populated if you feed the encoder output directly to the whisper_decode()
. One option to make this possible is to expose a whisper_cross()
API that precomputes cross-attention cache. It shouldn't be a difficult change, but I'm not sure if I'll be able to take a look soon. Hopefully someone from the community can give it a try
Otherwise, the intermediate tensor would be overwritten during subsequent executions.
@bobqianic Not only that, but this allows to call whisper_decode()
multiple times and reuse the cross attention instead of computing it each time
Hi, Can I use the API's to convert and run my whisper encoder and decoder models separately? If yes, how do I do this?