Closed zhoujxwilliam closed 1 month ago
Absolutely. For the purposes of this project, we can consider each dialect akin to a distinct language. Regarding text tokenization, there is no need for you to develop this component from scratch; please refer to the guide I've provided at https://github.com/nguyenhoanganh2002/XTTSv2-Finetuning-for-New-Languages/blob/3f8a0c5f0efa3eb10c12074d2433f1e754087c60/Readme.md#4-vocabulary-extension-and-configuration-adjustment
Based on my experience, a minimum of 50 hours of audio data is recommended to achieve satisfactory results when fine-tuning a single dialect.
Can this project be used to train dialects such as Min Nan and Tibetan in China? Whisper does not have tokenization for these dialects. Would I need to write the text tokenization front-end myself? Additionally, how much data (in terms of duration) would be needed to fine-tune a single dialect? Thank you!