Closed roxima closed 6 months ago
Currently, only radix-2 FFT algorithm is implemented. In fact, it's the first time when I see the usage of FFT of non-power-of-2 size in a DSP/AI framework. In general, it's possible to implement slower FFT/log-mel-filterbank with FFT of size 400, but it'll require some time. Meanwhile, the following extractor will produce the results as close to WhisperAI as possible (although, slightly different) - it uses FrameSize=400 and FFTSize=512 (like it's usually done, actually):
var samplingRate = 16000;
var bands = FilterBanks.MelBandsSlaney(80, samplingRate);
var filterbank = FilterBanks.MelBankSlaney(80, 512, samplingRate);
var options = new FilterbankOptions
{
SamplingRate = samplingRate,
FilterBank = filterbank,
FrameSize = 400,
HopSize = 160,
Window = WindowType.Hann,
SpectrumType = SpectrumType.Power,
NonLinearity = NonLinearityType.Log10,
LogFloor = 1e-10f,
};
var extractor = new FilterbankExtractor(options);
var vectors = extractor.ComputeFrom(signal);
After this you'll need to post-process vectors
similarly to WhisperAI's code (two last steps):
log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
log_spec = (log_spec + 4.0) / 4.0
I'm having problems doing FFT as Whisper does. Whisper has FFT size 400. NWaves requires this to be powers of 2. Can feature extraction as whisper be done with NWaves?