csukuangfj / kaldifeat

Kaldi-compatible online & offline feature extraction with PyTorch, supporting CUDA, batch processing, chunk processing, and autograd - Provide C++ & Python API
https://csukuangfj.github.io/kaldifeat
Other
187 stars 35 forks source link

kaldifeat python输出与 kaldi-asr c++输出有明显偏差? #92

Closed Consulting4J closed 7 months ago

Consulting4J commented 8 months ago

目前看代码差异已经越来越大 其实最关键最有用的是保持torchaudio.compliance.kaldi 与kaldifeat保持99.9%的兼容

csukuangfj commented 8 months ago

目前看代码差异已经越来越大 其实最关键最有用的是保持torchaudio.compliance.kaldi 与kaldifeat保持99.9%的兼容

这个是支持的。不说 99.9% 兼容, 99.9999% 兼容,应该是有的。


有计划把FEATURE提取同步到最新KALDI代码吗

最新的 kaldi 代码,是什么?

Consulting4J commented 8 months ago

我的意思是这部分代码 https://github.com/kaldi-asr/kaldi/tree/master/src/feat

csukuangfj commented 8 months ago

这个 repo, 使用的就是 kaldi 的 c++ 代码 + libtorch.

你不想要 libtorch, 可以用 https://github.com/csukuangfj/kaldi-native-fbank

Consulting4J commented 8 months ago

请教一下为何我的结果有偏差,是我哪里设置有问题吗?请大神指点

首先是跟 torchaudio.compliance.kaldi的比较,可以看出torchaudio 在设置了use_energy=True 且需要传入read_wave的数据才可以让torchaudio与 kaldifeat的结果相同, 直接使用soundfile读取则不行,必须乘以那个常数32768。 这个部分也都算正常,代码如下

import torch import torchaudio.compliance.kaldi as Kaldi from torch.nn.functional import cosine_similarity import soundfile import numpy as np

import kaldifeat from kaldiutils import get_devices, read_ark_txt, read_wave

wave = read_wave('dataset/test.wav') opts = kaldifeat.MfccOptions() #default mfcc = kaldifeat.Mfcc(opts) features = mfcc(wave.to('cpu'))

samples, sample_rate = soundfile.read(filename, dtype='float32')

samples=wave if len(samples.shape) == 1: samples = samples.unsqueeze(0) features2=Kaldi.mfcc(waveform=samples,use_energy=True, num_mel_bins=23,num_ceps=13, window_type='povey') normalized_tensor_1 = features / features.norm(dim=-1, keepdim=True) normalized_tensor_2 = features2 / features2.norm(dim=-1, keepdim=True) normalized_tensor_2=normalized_tensor_2.to(normalized_tensor_1.device)

计算余弦相似性 , 1意味着完全相等

cos_value = cosine_similarity(normalized_tensor_1, normalized_tensor_2) print(cos_value)

但是一旦开始比较kaldifeat 的python输出和 kaldi-asr c++ 输出时偏差就比较大了,不清楚哪里设置有问题 python代码如下:

def compute_mfcc(): wave = read_wave('dataset/test.wav') opts = kaldifeat.MfccOptions() #default mfcc = kaldifeat.Mfcc(opts) features = mfcc(wave.to('cpu')) features=to_numpy(features) for row in features: for col in row: print(col)

c++代码如下

float native_kaldi_mfcc(const char wave_file) { Matrix mfcc_features; // Configuration MfccOptions mfcc_opts; //mfcc_opts.frame_opts.samp_freq = 16000; // Modify as needed // Instantiate the MFCC extractor Mfcc mfcc(mfcc_opts); // Read the wave file WaveData wave_data; { Input ki(wave_file); //"dataset/test.wav") wave_data.Read(ki.Stream()); } // Extract the features mfcc.ComputeFeatures(wave_data.Data().Row(0), wave_data.SampFreq(), 1.0, &mfcc_features); //cout << "mfcc_feature rows: " << mfcc_features.NumRows() << ", cols:" << mfcc_features.NumCols() << std::endl; std::vector audio_feature; // copy matrix to 1 dimention array float audio_mfcc = (float )malloc(sizeof(float) mfcc_features.NumCols() mfcc_features.NumRows()); int index = 0; for (int i = 0; i < mfcc_features.NumRows (); i++) { for (int j = 0; j < mfcc_features.NumCols(); j++) { audio_mfcc[index] = mfcc_features.Index(i, j); cout<<audio_mfcc[index] <<endl; index++; } } return audio_mfcc; }

Consulting4J commented 8 months ago

这是测试的输出,输入文件就是 本repo中的 kaldifeat/python/tests/test_data/test.wav

mfcc_python.txt mfcc_c++.txt

csukuangfj commented 8 months ago

https://github.com/csukuangfj/kaldifeat/blob/master/kaldifeat/python/tests/test_mfcc_options.py

请自己去看这个测试文件。

你要确保,kaldifeat 和 torchaudio 使用一样的参数。

再次重复一下,你要确保,要使用一样的参数。

如果你还得不到一样的结果,请贴使用的参数,我来给你比对。

Consulting4J commented 8 months ago

其实python 版已经很可以了,使用的就是这个参数 Kaldi.mfcc(waveform=samples,use_energy=True, num_mel_bins=23,num_ceps=13, window_type='povey')

完整代码如下: opts = kaldifeat.MfccOptions() #default opts.device = device mfcc = kaldifeat.Mfcc(opts) features = mfcc(wave.to(device)) if device.type == "cpu": cpu_features = features

samples, sample_rate = soundfile.read(filename, dtype='float32')

    samples=wave #torch.from_numpy(samples)
    if len(samples.shape) == 1:
            samples = samples.unsqueeze(0)
    features2=Kaldi.mfcc(waveform=samples,use_energy=True, num_mel_bins=23,num_ceps=13, window_type='povey')

    normalized_tensor_1 = features / features.norm(dim=-1, keepdim=True)
    normalized_tensor_2 = features2 / features2.norm(dim=-1, keepdim=True)
    normalized_tensor_2=normalized_tensor_2.to(normalized_tensor_1.device)
    #计算余弦相似性  , 1意味着完全相等    
    cos_value = cosine_similarity(normalized_tensor_1, normalized_tensor_2)
    print(cos_value)

输出: tensor([1.0000, 0.9988, 0.9978, 0.9986, 0.9979, 0.9984, 0.9996, 0.9998, 0.9999, 1.0000, 0.9998, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 0.9997, 0.9900, 0.9974, 0.9947, 0.9966, 0.9999, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 0.9998, 0.9997, 0.9978, 0.9995, 0.9999, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 0.9994, 0.9972, 0.9962, 0.9921, 0.9851, 0.9829, 0.9767, 0.9752, 0.9788, 0.9999, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 0.9999, 1.0000, 1.0000, 1.0000, 0.9999, 0.9999, 0.9995, 0.9998, 0.9972, 0.9986, 0.9972, 0.9968, 0.9957, 0.9927, 0.9959, 0.9955, 0.9855])

我比较不解的是跟c++版本的输出比较为何差别很大

c++代码如下

float native_kaldi_mfcc(const char wave_file) { Matrix mfcc_features; // Configuration MfccOptions mfcc_opts; //mfcc_opts.frame_opts.samp_freq = 16000; // Modify as needed // Instantiate the MFCC extractor Mfcc mfcc(mfcc_opts); // Read the wave file WaveData wave_data; { Input ki(wave_file); //"dataset/test.wav") wave_data.Read(ki.Stream()); } // Extract the features mfcc.ComputeFeatures(wave_data.Data().Row(0), wave_data.SampFreq(), 1.0, &mfcc_features); //cout << "mfcc_feature rows: " << mfcc_features.NumRows() << ", cols:" << mfcc_features.NumCols() << std::endl; std::vector audio_feature; // copy matrix to 1 dimention array float audio_mfcc = (float )malloc(sizeof(float) mfcc_features.NumCols() mfcc_features.NumRows()); int index = 0; for (int i = 0; i < mfcc_features.NumRows (); i++) { for (int j = 0; j < mfcc_features.NumCols(); j++) { audio_mfcc[index] = mfcc_features.Index(i, j); cout<<audio_mfcc[index] <<endl; index++; } } return audio_mfcc; }

Consulting4J commented 8 months ago

把c++ 代码的MfccOptions跟test_mfcc_option.py 中的test_default()对齐了 但似乎结果差异更大了 结果在mfcc_c++.txt 中

float native_kaldi_mfcc(const char wave_file) { MfccOptions opts; opts.frame_opts.samp_freq = 16000; opts.frame_opts.frame_shift_ms = 10.0; opts.frame_opts.frame_length_ms = 25.0; opts.frame_opts.dither = 1.0; opts.frame_opts.preemph_coeff = 0.97; opts.frame_opts.remove_dc_offset = true; opts.frame_opts.window_type = "povey"; opts.frame_opts.round_to_power_of_two = true; opts.frame_opts.blackman_coeff = 0.42; opts.frame_opts.snip_edges = true; opts.mel_opts.num_bins = 23; opts.mel_opts.low_freq = 20; opts.mel_opts.high_freq = 0; opts.mel_opts.vtln_low = 100; opts.mel_opts.vtln_high = -500; opts.mel_opts.debug_mel = false; opts.mel_opts.htk_mode = false; opts.num_ceps = 13; opts.use_energy = true; opts.energy_floor = 0; opts.raw_energy = true; opts.cepstral_lifter = 22.0; opts.htk_compat = false; // Instantiate the MFCC extractor Mfcc mfcc(opts); // Read the wave file WaveData wave_data; { Input ki(wave_file); wave_data.Read(ki.Stream()); } // Extract the features Matrix mfcc_feature; mfcc.ComputeFeatures(wave_data.Data().Row(0), wave_data.SampFreq(), 1.0, &mfcc_feature); std::vector audio_feature; // copy matrix to 1 dimention array float audio_mfcc = (float )malloc(sizeof(float) mfcc_feature.NumCols() mfcc_feature.NumRows()); int index = 0; for (int i = 0; i < mfcc_feature.NumRows(); i++) { for (int j = 0; j < mfcc_feature.NumCols(); j++) { audio_mfcc[index] = mfcc_feature.Index(i, j); cout << audio_mfcc[index] << endl; index++; } // cout<<endl; } return audio_mfcc; }

mfcc_c++.txt

int main(int argc, char *argv[]) { native_kaldi_mfcc("test_data/test.wav"); return 0; }

csukuangfj commented 8 months ago

我建议,你列一个表格,把 torchaudio 和 kaldifeat 的选项,都列出来。

你不想比较,就列出来,我来比较。


(建议自己动手。我这里再帮你一下)

其实python 版已经很可以了,使用的就是这个参数 Kaldi.mfcc(waveform=samples,use_energy=True, num_mel_bins=23,num_ceps=13, window_type='povey')

我数了一下,上面 mfcc 里面,有 5 个参数。 你去 https://pytorch.org/audio/main/generated/torchaudio.compliance.kaldi.mfcc.html 这个链接数一数,看看有多少个参数。你再看看默认参数,和 kaldifeat 里面 有何区别?

这个就是简单的比对过程。

csukuangfj commented 8 months ago

还有, kaldi 里的 wave data 的范围是 -32768 到 32767, python 里面,一般是 -1 到1, 请检查下输入是否一致。

csukuangfj commented 8 months ago

float native_kaldi_mfcc(const char wave_file) { MfccOptions opts; opts.frame_opts.samp_freq = 16000; opts.frame_opts.frame_shift_ms = 10.0; opts.frame_opts.frame_length_ms = 25.0; opts.frame_opts.dither = 1.0;

你的 c++ 里面,怎么还把 dither 打开了?

说好的,要用一样的参数呢。

csukuangfj commented 8 months ago

https://github.com/csukuangfj/kaldifeat/blob/master/kaldifeat/python/tests/test_mfcc.py

这个是我们测试 kaldifeat 里面 mfcc 的代码。

你要坚信,kaldifeat 计算出来的 mfcc, 是和 kaldi 完全兼容的。

如果你得到的结果不一样,那么请查找自己的原因。上面说了,1 要检查使用的参数,是否完全一致。2要检查给定的输入,是否完全一致。

Consulting4J commented 7 months ago

y