LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.42k stars 137 forks source link

Finetune for ASR #121

Open wntg opened 1 year ago

wntg commented 1 year ago

Thanks for your great work! I want to try to train a audio encoder with CLAP and then use this encoder and ctc to fintune ASR datasets. May I ask if this method is feasible, or if there are better suggestions? Thank you.

RetroCirce commented 1 year ago

Hi,

I think one problem with the ASR is how to align the temporal information of speech (like words in different time frames). You can extract the last second layer of the CLAP audio encoder / text encoder, you might be able to obtain a temporal embedding (T, D), where D is the embedding dimension and T the length of the audio (in some resolutions).

I don't think the temporal embedding (T,D) conceives a good feature of exactly capturing each time frame information as we don't train with this target, but might be helpful if you can conduct the finetune on it or else.

wntg commented 1 year ago

Thanks for your reply and help. I understood it