ar1st0crat / NWaves

.NET DSP library with a lot of audio processing functions
MIT License
453 stars 71 forks source link

FFT compatible with OpenAI Whisper features #70

Closed roxima closed 6 months ago

roxima commented 1 year ago

I'm having problems doing FFT as Whisper does. Whisper has FFT size 400. NWaves requires this to be powers of 2. Can feature extraction as whisper be done with NWaves?

ar1st0crat commented 1 year ago

Currently, only radix-2 FFT algorithm is implemented. In fact, it's the first time when I see the usage of FFT of non-power-of-2 size in a DSP/AI framework. In general, it's possible to implement slower FFT/log-mel-filterbank with FFT of size 400, but it'll require some time. Meanwhile, the following extractor will produce the results as close to WhisperAI as possible (although, slightly different) - it uses FrameSize=400 and FFTSize=512 (like it's usually done, actually):

var samplingRate = 16000;

var bands = FilterBanks.MelBandsSlaney(80, samplingRate);
var filterbank = FilterBanks.MelBankSlaney(80, 512, samplingRate);

var options = new FilterbankOptions
{
                SamplingRate = samplingRate,
                FilterBank = filterbank,
                FrameSize = 400,
                HopSize = 160,
                Window = WindowType.Hann,
                SpectrumType = SpectrumType.Power,
                NonLinearity = NonLinearityType.Log10,
                LogFloor = 1e-10f,
};

var extractor = new FilterbankExtractor(options);
var vectors = extractor.ComputeFrom(signal);

After this you'll need to post-process vectors similarly to WhisperAI's code (two last steps):

log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
log_spec = (log_spec + 4.0) / 4.0