Closed lucasjinreal closed 2 weeks ago
EAT is a self-supervised audio model pre-trained on the AudioSet dataset. However, it has not been pre-trained on speech-specific datasets, so it cannot be directly fine-tuned for ASR tasks. Nevertheless, you could explore conducting experiments by pre-training the model on speech datasets and then fine-tuning it for ASR tasks to evaluate its performance in the speech modality.
How to using this for ASR?