Wav2vecDS是否可以做得更通用

tailangjun commented 3 months ago

我发现 torchaudio.pipelines.HUBERT_ASR_LARGE输出的音频特征的维度是 (m, 29)， DeepSpeech v0.1输出的音频特征的维度是 (n, 29)，m和n相差不大，Wav2vecDS的作用是将 (m, 29)维度的特征映射到 (n, 29)维度。我在想是不是可以把 Wav2vecDS做得更加通用一些，支持将任意维度的特征映射到 (n, 29)维度，类似于 ER-NeRF/nerf_triplane/network.py中的 AudioNet。这样就可以随便选用支持中文的模型来提取语音特征，比方说 chinese-wav2vec2-large和 chinese-hubert-large。

I found that the dimension of the audio feature output by torchaudio.pipelines.HUBERT_ASR_LARGE is (m, 29), and the dimension of the audio feature output by DeepSpeech v0.1 is (n, 29). m and n are not much different. The function of Wav2vecDS is to convert ( Features in m, 29) dimensions are mapped to (n, 29) dimensions. I'm wondering if Wav2vecDS can be made more general and support mapping features of any dimension to (n, 29) dimensions, similar to AudioNet in ER-NeRF/nerf_triplane/network.py. In this way, you can choose any model that supports Chinese to extract speech features, such as chinese-wav2vec2-large and chinese-hubert-large.

Elsaam2y commented 2 months ago

Yes, thanks for the recommendation. I tried doing so mainly to support Chinese, however the mapping became more complex and the output features weren't always convincing as noticed from the output lip-sync.

Elsaam2y commented 2 months ago

But please feel free to open a PR if you worked on this and managed to get better results.

tailangjun commented 2 months ago

Yes, thanks for the recommendation. I tried doing so mainly to support Chinese, however the mapping became more complex and the output features weren't always convincing as noticed from the output lip-sync.

懂了，谢谢

Elsaam2y / DINet_optimized

Wav2vecDS是否可以做得更通用 #22