futo-org / whisper-acft

MIT License
87 stars 3 forks source link

how to use this model with whisper.cpp #6

Open stopthinking102 opened 1 month ago

stopthinking102 commented 1 month ago

can u share which parameter to set to use this parameter in whisper.cpp. is it the n_audio_ctx paramter with 1500 referring to 1.5 seconds. Would have been great if there was an android sample.

// medium // hparams: { // 'n_mels': 80, // 'n_vocab': 51864, // 'n_audio_ctx': 1500, // 'n_audio_state': 1024, // 'n_audio_head': 16, // 'n_audio_layer': 24, // 'n_text_ctx': 448, // 'n_text_state': 1024, // 'n_text_head': 16, // 'n_text_layer': 24 // } // // default hparams (Whisper tiny) struct whisper_hparams { int32_t n_vocab = 51864; int32_t n_audio_ctx = 1500; int32_t n_audio_state = 384;

abb128 commented 1 month ago

You can run whisper.cpp main and set -ac to a number between 1 and 1500. It's not in milliseconds, rather it's 50 per second. So if you have a 30 second clip (maximum), it's 1500. If you have a 5 second clip, it's 5 * 50 = 250. It also usually helps to add a constant of around 64, so you'd do 5 * 50 + 64 = 314.

Programmatically you can set audio_context field of whisper_full_params