Guidance on Training New HuBert Model on New Languages

🚀 Feature Request

Guidance on Training New HuBert Model on New Languages

Motivation

Currently, the readme mentions that guidance on training the HuBert model on new languages will be added soon, but it is not yet included. This lack of documentation creates challenges for users who wish to expand the HuBert model's capabilities to languages not currently supported. Specifically, I am working on a project that requires the application of the HuBert model to a new language, and the absence of training guidance is a significant roadblock.

Pitch

Provide comprehensive guidance on training the HuBert model on new languages. This should include:

Data Preparation: Detailed instructions on how to gather and preprocess training data for new languages, including required formats and any recommended preprocessing steps.
Model Configuration: Information on how to configure the HuBert model for a new language, including any necessary modifications to the model architecture or parameters.
Training Procedure: Step-by-step instructions on the training process, including commands, scripts, and any important considerations or best practices.
Evaluation: Guidelines on how to evaluate the model's performance on the new language, including recommended metrics and evaluation protocols.
Examples: Provide example scripts or notebooks demonstrating the entire process of training and evaluating the HuBert model on a new language.

Alternatives

As an alternative, community members could be encouraged to share their experiences and tips on training the HuBert model on new languages via a dedicated discussion forum or a shared repository of user-contributed guides. However, an official, comprehensive guide would be more reliable and standardized.

Additional Context

Providing this guidance will significantly benefit researchers and developers working on multilingual speech recognition projects. It will also help in broadening the application of the HuBert model to a more diverse set of languages, promoting inclusivity and accessibility in speech technologies.

Thank you for considering this feature request.

facebookresearch / fairseq