Fine-tuning QuartzNet 15*5 with multiple speaker audio dataset

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Apache License 2.0

11.84k stars 2.46k forks source link

Fine-tuning QuartzNet 15*5 with multiple speaker audio dataset #2375

Closed MelikaBahmanabadi closed 3 years ago

MelikaBahmanabadi commented 3 years ago

Hi, Can I use Persian Multiple speaker audio datasets for fine-tuning QuartzNet15*5? Will be the WER low as I fine-tune it with a single speaker dataset? Thanks

Environment overview

Environment location: Google Colab
Method of NeMo install: !pip install nemo_toolkit[asr]
NeMo version: 1.0.0
Learning Rate: 1e-3

Environment details

OS version : "Ubuntu20.04.3 LTS"
PyTorch version : "1.7.1"

titu1994 commented 3 years ago

The model might train fine, but I'm evaluation on other speakers will have poor WER. It can be tried, but for proper generalization more data from multiple speakers would be useful