wav2vec features - Githubissues

lidachuan211 commented 1 year ago

audio wav2vec features support for Chinese ？

Elsaam2y commented 1 year ago

Will work on it in the next few days. Will keep you posted.

lidachuan211 commented 1 year ago

thanks

skyler14 commented 1 year ago

out of curiousity whats the total variety of languages wav2vec, how would have compare to like wavlm?

Elsaam2y commented 1 year ago

Both models, wav2vec and WavLM, are widely recognized for their capabilities in Automatic Speech Recognition (ASR) tasks and they utilize the LibriSpeech Dataset for training purposes. However, when it comes to audio feature extraction, wav2vec stands out as the preferred choice as it is an end2end approach, beside its flexibly and the architecture which is designed to extract high-level speech features directly from raw audio waveforms. In contrast, WavLM is primarily oriented towards the generation of speech from text inputs.

foxyear-kyumin commented 1 year ago

How was wav2vecDS.pt trained? If I want to train my own dataset for wav2vecDS.pt. I tried using other pre-trained wav2vec.pt models, but they don’t seem to work.

Elsaam2y commented 1 year ago

@qiu8888 To avoid any confusion, wav2vecDS.pt is a torch model which I trained using the class _Wav2vecDS to learn a mapping between wav2vec features to deep speech features. This way I can use wav2vec ASR models (in this case HubertASR )with the trained DINet model without causing any issues, since the model was trained on DeepSpeech v0.1.0 and it is very slow. However, I will update the readme and add instructions anyway for training the mapping model on your own dataset in the next few days if needed. But keep in mind that I am not retraining the wav2vec ASR model itself, only the mapping is needed here. For more Information about the wav2vec ASR models and how these models were trained, please refer to their documentation here.

QuantJia commented 1 year ago

Hi, the work released?

Elsaam2y commented 1 year ago

@lidachuan211 @QuantJia I could find a solution for the Chinese but it doesn't work well for other languages. For the moment trying a different solution by using newer versions of DeepSpeech and converting the model to onnx for inference optimization. I am working only for few time unfortunately on this but will share a solution hopefully soon.

Luckygyana commented 1 year ago

@Elsaam2y did you solved for onnx for inference optimization?

davidmartinrius commented 1 year ago

@Elsaam2y did you solved for onnx for inference optimization?

I tried to export DINet to onnx, but I couldn't. I finally used pytorch script, anyway the performance is the same. The only advantage is that I can load it to nvidia triton server. You can add this code after model.eval() in inference.py

audio_feature_dim = ds_feature.shape[1]
# Adjust the batch size to match the model's expected batch size (2)
dummy_audio = torch.randn(1, 29, audio_feature_dim).cuda().float()
dummy_ref_image = torch.randn(1, 15, resize_h, resize_w).cuda().float()
dummy_frame_image = torch.randn(1, 3, resize_h, resize_w).cuda().float()

# Ensure that input data is of float32 data type
dummy_audio = dummy_audio.to(torch.float32)
dummy_ref_image = dummy_ref_image.to(torch.float32)
dummy_frame_image = dummy_frame_image.to(torch.float32)

# Specify the ONNX file name
traced_model = torch.jit.trace(model, (dummy_frame_image, dummy_ref_image, dummy_audio))

# Specify the name for the TorchScript model file
scripted_model_filename = "DINet_scripted.pt"

# Save the TorchScript model to the specified file
traced_model.save(scripted_model_filename)

for inference you can replace this: model = DINet(opt.source_channel, opt.ref_channel, opt.audio_channel).cuda() by this: model = torch.jit.load("DINet_scripted.pt").eval()

For onnx I tried this, but didn't work:

` onnx_file_name = "DINet.onnx"

# Export the model to ONNX
torch.onnx.export(
    model,
    (dummy_frame_image, dummy_ref_image, dummy_audio),
    onnx_file_name,
    verbose=True,
    input_names=["input_frame", "input_ref", "input_audio"],
    output_names=["output"],
    opset_version=17,  # ONNX opset version (adjust as needed)

)`

Elsaam2y commented 1 year ago

@Luckygyana @davidmartinrius the onnx conversion for this model is a bit tricky since it contains some operations which are not supported by onnx and requires some modifications to make it work. Furthermore, it won't boost the inference speed significantly and will be almost the same since native torch models are already fast, unless you were planning to integrate it with some other models and prefer to use onnx for ease of development.

foxyear-kyumin commented 1 year ago

为了避免任何混淆，wav2vecDS.pt 是一个火炬模型，我使用类_Wav2vecDS对其进行了训练，以学习wav2vec特征与深度语音特征之间的映射。这样，我可以将wav2vec ASR模型（在本例中为HubertASR）与经过训练的DINet模型一起使用而不会引起任何问题，因为该模型是在DeepSpeech v0.1.0上训练的，而且速度非常慢。但是，如果需要，我将在接下来的几天内更新自述文件并添加说明，以便在您自己的数据集上训练映射模型。但请记住，我不是在重新训练 wav2vec ASR 模型本身，这里只需要映射。有关 wav2vec ASR 模型以及如何训练这些模型的更多信息，请参阅此处的文档。

I have retrained SyncNet by mapping the wav2vec features with deep speech features using your wav2vecDS.pt model, and the synchronization performance has improved somewhat. However, I would like to try the latest DeepSpeech model, but it has significant differences in parameters and output structure compared to v0.10. Can you help with that?

Elsaam2y commented 1 year ago

@qiu8888 I am working at the moment with deep speech 0.6.0. If my tests passed fine, I will prepare a new mapping and push it. Which version did you try for deepspeech, 0.9.1? and did you notice significant difference wrt speed compared to 0.1.0?

9bitss commented 1 year ago

any luck with your tests?

ketyi commented 1 year ago

Hi @Elsaam2y,

Have you tried completely replacing the deep speech features with wav2vec2 features and retraining SyncNet and DINet with that?

Elsaam2y commented 1 year ago

@9bitss unfortunately need to replace and retrain DINet with the latest model for deepspeech. The first try didn't work fine and the model didn't perform well. One alternative solution is to learn mapping from the latest deepspeech model to the current one used to avoid retraining. Didn't have time to test this yet but it should work theoritically.

Elsaam2y commented 1 year ago

@ketyi with wav2vec we will need model per language as the one for english won't perform well on other languages. This would add more complexity to the pipeline and hence I tried retraining but with latest version of deepspeech as it supports onnx and gpu.

ketyi commented 1 year ago

@Elsaam2y but you are already using wav2vec in the pipeline, so I don't get your point.

Elsaam2y commented 1 year ago

@ketyi Sorry for my late response. I mean that retraining the syncnet and the model on wav2vec features would still have some issues regarding the generalization. When I used wav2vec I didn't realize at the beginning its problems with some languages, and hence was focusing recently on updating the model to use the latest version of deep speech instead.

Inferencer commented 1 year ago

@ketyi Sorry for my late response. I mean that retraining the syncnet and the model on wav2vec features would still have some issues regarding the generalization. When I used wav2vec I didn't realize at the beginning its problems with some languages, and hence was focusing recently on updating the model to use the latest version of deep speech instead.

really looking forward to the latest deepspeech whats the ETA on training the mapping to work with it? for reference what version of deepspeech was DINet using before?

Elsaam2y commented 1 year ago

At the moment I am quite busy with some other projects so I would give it an estimation of few weeks.

einsqing commented 9 months ago

Big guy, what's the progress, when Shmoa support onnx?

tailangjun commented 7 months ago

很好奇，为什么推理时使用 wav2vec + wav2vecDS生成的音频特征，但是训练时用的却是 deepspeech。两者不应该都采用 wav2vec + wav2vecDS吗？看了上面的记录，好像是训练时使用 wav2vec + wav2vecDS对中文支持好了，但是其他语言又变差了，不知道我理解的对不对。如果只需要支持中文，是不是训练和推理时都使用 wav2vec + wav2vecDS，效果会比较好

I'm curious why the audio features generated by wav2vec + wav2vecDS are used during inference, but deepspeech is used during training. Shouldn't both be using wav2vec + wav2vecDS? After reading the above records, it seems that using wav2vec + wav2vecDS during training has improved support for Chinese, but other languages have become worse. I don’t know if I understand it correctly. If you only need to support Chinese, should you use wav2vec + wav2vecDS for both training and inference? The effect will be better

tailangjun commented 7 months ago

为了任何混乱，wav2vecDS.pt是一个火炬模型，我使用类_Wav2vecDS避免进行了训练，以学习wav2vec特征与深度语音特征之间的映射。这样，我将wav2vec ASR模型（在本例中为）HuberASR）与经过训练的DINet模型一起使用而不会引起任何问题，因为该模型是在DeepSpeech v0.1.0上训练的，而且速度非常慢。但是，如果需要，我将在接下来的几天内更新自文件并添加说明，方便在您自己的数据集上训练映射模型。但请记住，我不是在重新训练 wav2vec ASR 模型本身，这里只需要映射。有关 wav2vec ASR 模型以及如何训练这些模型的更多信息，请参阅此处的文档。

我通过使用您的 wav2vecDS.pt 模型将 wav2vec 特征与深度语音特征映射来重新训练 SyncNet，并且同步性能有所提高。不过，我想尝试一下最新的 DeepSpeech 模型，但它与 v0.10 相比在参数和输出结构上有显着差异。你能帮忙吗？

我尝试过多个版本的 DeepSpeech的pb，发现他们的输出维度和v0.1是不同的，应该是从某个版本开始就发生了改变。训练时报错 ValueError: Cannot feed value of shape (1, 96, 494) for Tensor 'deepspeech/input_node:0', which has shape '(1, 16, 19, 26)'

我尝试的 DeepSpeech版本有这些

deepspeech-0.4.1-output_graph.pb tags里面有，可以直接下载
deepspeech-0.8.2-output_graph.pb tags里面没有，使用工具从 pbmm转换得到
deepspeech-0.9.2-output_graph.pb tags里面没有，使用工具从 pbmm转换得到
deepspeech-0.9.3-zh-CN-output_graph.pb tags里面没有，使用工具从 pbmm转换得到

pbmm转pb的工具

https://github.com/openvinotoolkit/open_model_zoo/blob/master/models/public/mozilla-deepspeech-0.8.2/pbmm_to_pb.py

tailangjun commented 7 months ago

@qiu8888 To avoid any confusion, wav2vecDS.pt is a torch model which I trained using the class _Wav2vecDS to learn a mapping between wav2vec features to deep speech features. This way I can use wav2vec ASR models (in this case HubertASR )with the trained DINet model without causing any issues, since the model was trained on DeepSpeech v0.1.0 and it is very slow. However, I will update the readme and add instructions anyway for training the mapping model on your own dataset in the next few days if needed. But keep in mind that I am not retraining the wav2vec ASR model itself, only the mapping is needed here. For more Information about the wav2vec ASR models and how these models were trained, please refer to their documentation here.

请问你训练 wav2vecDS用的是什么数据集，我如果要用自己的数据集来训练，需要如何制作数据集呢？另外对数据集是否有语言要求

May I ask which dataset you used to train wav2vecDS? If I want to use my own data set for training, how do I make it? In addition, are there any language requirements for the dataset?

Elsaam2y / DINet_optimized

wav2vec features #1