Open Quackdoc opened 1 year ago
the whisper model itself expects 16Khz mono.
ah that make sense, I would assume burn doesn't do down sampling for the samplerate or for channel downmixing
This is partially addressed by 4080a33, but if I get the time I plan on looking into resampling and channel downmixing. I do have some work done, however I was using dasp which has proven it'self to be rather unusable, so im looking into different crates.
Looked into fon and it seems like it may work, but i don't like how it hasn't been active since feb'22.
currently looking into other crates
@Quackdoc have a look at https://github.com/HEnquist/rubato
It does what you need. I've had no success with the sync Ftt methods yet but SincFixedIn which is in their main example works well.
Here's how I'm using it - I have a pop at the end but the main downsampling is very good:
(I had a feeling the Synchronous resampling FFT method might be better for wasm but haven't tested it and may have misunderstood what's its designed for, as the output is terribly distorted. Still investigating. Hopefully SincInterpolationType::Linear is good enough for real-time use cases)
Seems like audio decode is picky on what gets input to it
Audio mediainfo
Audio file: https://cdn.discordapp.com/attachments/615105639567589376/1141946730485665893/slap.wav
whisper-ctranslate2:
EDIT: transcoding the audio file using
ffmpeg -i .\slap.wav -ar SAMPLE_RATE -ac 1 slap-edit.wav
seems to make it work, It needs to be both single channel as well as 41khz or less.at 41khz the audio output was
at 24khz and below it is