ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
34.39k stars 3.5k forks source link

Does it only support wav files, not mp3? #1416

Closed fniks closed 10 months ago

fniks commented 10 months ago

Does it only support wav files, not mp3?

jmalfara commented 10 months ago

From my experience whisper in general needs specific encoding. You can use FFMPEG to process the audio before pushing it to transcription. I use the Node Addon so here's what I do.

  async encodeForWhisper(
    inputFile: string
  ): Promise<string> {
    const newFile = `${this.tempPathDir}/${randomUUID()}.wav`;
    await new Promise((resolve, reject) => {
      ffmpegCommand()
        .addInput(inputFile)
        .audioFrequency(16000)
        .audioBitrate(16000)
        .audioFilters([
          'lowpass=3000',
          'highpass=200',
          'afftdn=nf=-80',
          'silenceremove=stop_periods=-1:stop_duration=2:stop_threshold=0.02',
        ])
        .on('error', (err) => {
          this.logger.error(err);
          reject(err);
        })
        .on('end', () => {
          resolve(newFile);
        })
        .save(newFile);
    });
    return newFile;
  }

Using FFMPEG this changes bitrate and frequency to 16000 khz and removes any "whitespace" in the audio file. Hope that helps

bjnortier commented 10 months ago

Yes you have to convert to 16bit 16kHz PCM: $ ffmpeg -i <input gile> -acodec pcm\_s16le -ac 1 -ar 16000 <output file>

mikkovedru commented 9 months ago

@bjnortier

Yes you have to convert to 16bit 16kHz PCM: $ ffmpeg -i <input gile> -acodec pcm\_s16le -ac 1 -ar 16000 <output file>

Do you know why is this trivial conversion not done automatically, like it is done in the original Whisper?

bjnortier commented 9 months ago

The original Whisper also uses ffmpeg, and requires it as an external dependency. It just runs it automatically. Not requiring ffmpeg in Whisper.cpp is the right decision, because not all platforms can use it. For example in my Swift apps I use other libraries or standard libraries.

mikkovedru commented 9 months ago

Thank you for answering.

I understand not requiring any external dependency to get the normal functionality. In fact, I also support it.

But I don't understand why whisper.cpp can't have a highly sought extra functionality on top of the normal abilities:

  1. if you give an acceptable input format, whisper.cpp can just use it
  2. if the format is unacceptable, and ffmpeg exists, whisper.cpp can inform/warn about it, but still convert it automatically
  3. if the format is unacceptable, but ffmpeg doesn't exist, whisper.cpp can inform/warn about it and quit.

Now, in order to use whisper.cpp from command line I have to spend time trying to program, test and debug some kind of helpful shell script as a wrapper for whisper.cpp, which would convert the files, run whisper.cpp and then delete extra audio files. This is just wasteful for who knows how many other people apart from me (let alone many people who don't know how to do it).

Does that sound reasonable to you?

bobqianic commented 9 months ago

Does that sound reasonable to you?

Yes, absolutely. However, we're not full-time developers, and there are some more urgent tasks that need our attention first. But if anyone has spare time and would like to contribute their code, please don't hesitate to open a pull request.