TencentGameMate / chinese_speech_pretrain

chinese speech pretrained models
1.03k stars 83 forks source link

New pretrained model support? #2

Closed leon2milan closed 2 years ago

leon2milan commented 2 years ago

Will you support new model? like:

pengchengguo commented 2 years ago

Hi @leon2milan, We are not planning to train CPC, APC and VQ-VAE based self-supervised speech pre-trained models, instead, we would like to train a WavLM or Data2Vec model on our Mandarin speeches in the future. As for the d-vector or x-vector models, I think they are tasked-specific models trained with labeled data. We aim to release unsupervised speech pre-trained models (upstream models) that researchers could use for their downstream tasks.

leon2milan commented 2 years ago

@pengchengguo Thank you for your reply. For wav2vec, I want to test the result of FORCED ALIGNMENT between WAV2VEC and mfa. It need load like below:

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

I think the last layer of model is (lm_head): Linear(in_features=768, out_features=32, bias=True), so I can do forced alignment? Are facebook's vocab.json, tokenizer_config.json files suitable for yours? If not can you release?

LiuShixing commented 2 years ago

The vocab.json of facebook/wav2vec2 is not suitable for ours. Our model does not have a tokenizer as it was pretrained on audio alone. You can refer to https://huggingface.co/blog/fine-tune-wav2vec2-english to create Wav2Vec2CTCTokenizer and Wav2Vec2Processor for chinese

leon2milan commented 2 years ago

@LiuShixing Thank you very much.

cjgdo commented 2 years ago

请问上面的意思是如果需要得到Wav2Vec2CTCTokenizer,需要自己创建词表吗

pengchengguo commented 2 years ago

根据我的理解,Wav2Vec2CTC 模型是在 Wav2Vec2 模型的基础上使用有监督数据,根据 CTC loss,finetune 过的。我们只提供了无监督预训练模型,如果需要 Wav2Vec2CTC 模型,应该要自己进行 finetune。

LiuShixing commented 2 years ago

是的

发自我的iPhone

------------------ Original ------------------ From: cjgdo @.> Date: Mon,Jul 11,2022 3:50 PM To: TencentGameMate/chinese_speech_pretrain @.> Cc: LiuShixing @.>, Mention @.> Subject: Re: [TencentGameMate/chinese_speech_pretrain] New pretrained modelsupport? (Issue #2)

请问上面的意思是如果需要得到Wav2Vec2CTCTokenizer,需要自己创建词表吗

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>