Open wntg opened 1 year ago
Hi,
I think one problem with the ASR is how to align the temporal information of speech (like words in different time frames). You can extract the last second layer of the CLAP audio encoder / text encoder, you might be able to obtain a temporal embedding (T, D), where D is the embedding dimension and T the length of the audio (in some resolutions).
I don't think the temporal embedding (T,D) conceives a good feature of exactly capturing each time frame information as we don't train with this target, but might be helpful if you can conduct the finetune on it or else.
Thanks for your reply and help. I understood it
Thanks for your great work! I want to try to train a audio encoder with CLAP and then use this encoder and ctc to fintune ASR datasets. May I ask if this method is feasible, or if there are better suggestions? Thank you.