csukuangfj / kaldifeat

Kaldi-compatible online & offline feature extraction with PyTorch, supporting CUDA, batch processing, chunk processing, and autograd - Provide C++ & Python API
https://csukuangfj.github.io/kaldifeat
Other
186 stars 35 forks source link

kaldifeat特征提取和whisper的特征提取不一致 #101

Closed yangppde closed 1 month ago

yangppde commented 3 months ago

kaldifeat/kaldifeat/python/tests/test_whisper_fbank.py,在这个文件中,两种方法的提取结果不一致,而第一种和whisper的提取结果是一致的

johnchienbronci commented 2 months ago

I also encountered the same problem.

whisper print(log_spec)

tensor([[-0.5389, -0.5389, -0.3746,  ..., -0.5389, -0.5389, -0.5389],
        [-0.5389, -0.5389, -0.2457,  ..., -0.5389, -0.5389, -0.5389],
        [-0.5389, -0.5047, -0.2503,  ..., -0.5389, -0.5389, -0.5389],
        ...,
        [-0.5389, -0.5389, -0.5389,  ..., -0.5389, -0.5389, -0.5389],
        [-0.5389, -0.5389, -0.5389,  ..., -0.5389, -0.5389, -0.5389],
        [-0.5389, -0.5389, -0.5389,  ..., -0.5389, -0.5389, -0.5389]])

kaldifeat

tensor([[[-0.5360, -0.3759, -0.5360,  ..., -0.2331, -0.1518, -0.3079],
         [-0.5360, -0.1716, -0.1987,  ..., -0.1566, -0.0230, -0.1632],
         [-0.5104, -0.1350, -0.0655,  ..., -0.2010, -0.0498, -0.0667],
         ...,
         [-0.5360, -0.5360, -0.5360,  ..., -0.5360, -0.5360, -0.5360],
         [-0.5360, -0.5360, -0.5360,  ..., -0.5360, -0.5360, -0.5360],
         [-0.5360, -0.5360, -0.5360,  ..., -0.5360, -0.5360, -0.5360]]])

wave, sample_rate = librosa.load(wav_filename, sr=16000, mono=True) wave = torch.tensor(wave)

opts = kaldifeat.WhisperFbankOptions() opts.device = torch.device('cpu', 0) opts.num_mels = 80 fbank = kaldifeat.WhisperFbank(opts) features = fbank(wave) print(features.t().unsqueeze(0))

這寫法是正確的嗎

csukuangfj commented 2 months ago

用 kaldifeat whisper fbank 计算出来的特征,识别会有问题否?

johnchienbronci commented 2 months ago

用kaldifeat 產生的fbank 會有問題 沒有成功識別出文字

我將whisper.audio -> log_mel_spectrogram 修改

def log_mel_spectrogram(
    audio: Union[str, np.ndarray, torch.Tensor],
    n_mels: int = 80,
    padding: int = 0,
    device: Optional[Union[str, torch.device]] = None,
):
    if not torch.is_tensor(audio):
        if isinstance(audio, str):
            audio = load_audio(audio)
        audio = torch.from_numpy(audio)

    opts = kaldifeat.WhisperFbankOptions()
    opts.device = torch.device('cpu', 0)
    opts.num_mels = 80
    fbank = kaldifeat.WhisperFbank(opts)
    features = fbank(torch.tensor(audio))

    return features.t() --> 這樣正確嗎??
csukuangfj commented 2 months ago

可以准备一个 colab , 已方便我们复现么?

有同学用过 kaldifeat 提 whisper 的特征,然后去识别的。