Closed yangppde closed 1 month ago
I also encountered the same problem.
whisper print(log_spec)
tensor([[-0.5389, -0.5389, -0.3746, ..., -0.5389, -0.5389, -0.5389],
[-0.5389, -0.5389, -0.2457, ..., -0.5389, -0.5389, -0.5389],
[-0.5389, -0.5047, -0.2503, ..., -0.5389, -0.5389, -0.5389],
...,
[-0.5389, -0.5389, -0.5389, ..., -0.5389, -0.5389, -0.5389],
[-0.5389, -0.5389, -0.5389, ..., -0.5389, -0.5389, -0.5389],
[-0.5389, -0.5389, -0.5389, ..., -0.5389, -0.5389, -0.5389]])
kaldifeat
tensor([[[-0.5360, -0.3759, -0.5360, ..., -0.2331, -0.1518, -0.3079],
[-0.5360, -0.1716, -0.1987, ..., -0.1566, -0.0230, -0.1632],
[-0.5104, -0.1350, -0.0655, ..., -0.2010, -0.0498, -0.0667],
...,
[-0.5360, -0.5360, -0.5360, ..., -0.5360, -0.5360, -0.5360],
[-0.5360, -0.5360, -0.5360, ..., -0.5360, -0.5360, -0.5360],
[-0.5360, -0.5360, -0.5360, ..., -0.5360, -0.5360, -0.5360]]])
wave, sample_rate = librosa.load(wav_filename, sr=16000, mono=True) wave = torch.tensor(wave)
opts = kaldifeat.WhisperFbankOptions() opts.device = torch.device('cpu', 0) opts.num_mels = 80 fbank = kaldifeat.WhisperFbank(opts) features = fbank(wave) print(features.t().unsqueeze(0))
這寫法是正確的嗎
用 kaldifeat whisper fbank 计算出来的特征,识别会有问题否?
用kaldifeat 產生的fbank 會有問題 沒有成功識別出文字
我將whisper.audio -> log_mel_spectrogram 修改
def log_mel_spectrogram(
audio: Union[str, np.ndarray, torch.Tensor],
n_mels: int = 80,
padding: int = 0,
device: Optional[Union[str, torch.device]] = None,
):
if not torch.is_tensor(audio):
if isinstance(audio, str):
audio = load_audio(audio)
audio = torch.from_numpy(audio)
opts = kaldifeat.WhisperFbankOptions()
opts.device = torch.device('cpu', 0)
opts.num_mels = 80
fbank = kaldifeat.WhisperFbank(opts)
features = fbank(torch.tensor(audio))
return features.t() --> 這樣正確嗎??
可以准备一个 colab , 已方便我们复现么?
有同学用过 kaldifeat 提 whisper 的特征,然后去识别的。
kaldifeat/kaldifeat/python/tests/test_whisper_fbank.py,在这个文件中,两种方法的提取结果不一致,而第一种和whisper的提取结果是一致的