Open Dolyfin opened 1 month ago
Hi @Dolyfin
You can manually jump to step 2 and populate your own CSV files, into the relevant boxes, however I apprecaite you are talking about something slightly different here. Also, I do intend to document that process a little better in the wiki (finetuning is still on my WIKI list to write at some point).
I cant find any details on "lab" file format (other than for label printers, which clearly isnt right. Have you any links to something about the file format or tell me some software that works with it, so I can get a better understanding about what you are suggesting.
Thanks
.lab is just raw text here. It’s what Fishspeech uses in their fine tuning process. I would just assume .txt for text instead.
Hi @Dolyfin
Sorry for the later reply, but Ive been dealing with other things in life for a while, see here https://github.com/erew123/alltalk_tts/issues/377
So, the underlying requirement for the formatting layout for Coqui XTTS training, is set by Coqui's scripts. Please see the reference here on their documentation https://docs.coqui.ai/en/latest/formatting_your_dataset.html#formatting-your-dataset (see the bit that says We recommend the following format delimited by |. In the following example, audio1, audio2 refer to files audio1.wav, audio2.wav etc...........
)
The only way I could see to handle what you are describing would be to write a bit of script to rip through the .lab files and generate the resulting/required Coqui CSV files, which in principle, shouldn't be too hard. The only real decision the user would need to make would be the % to use for the Evaluation CVS and the % used for the Training CSV. It wouldnt be too hard to knock something together to do this.
Q. I assume you would have 1x folder that has all your dataset in it, populated with your audio and lab files? And of course, that would be the dataset to convert to the Coqui format.
Thanks
Is your feature request related to a problem? Please describe. I have a dataset with uncommon words that I cannot expect Whisper or any ASR model to be able transcribe accurately. The dataset is already perfectly transcribed with each audio file having an accompanying .lab (label) file with the manual transcription in raw text.
The generating dataset step only allows modifications of the transcript csv after the wavs are automatically split. This makes the perfect existing transcriptions essentially useless and impossible to combine manually on large datasets. (dataset in hours).
Describe the solution you'd like Option to not use ASR for transcription and to use existing text files. If files need to be at a certain length for training. Make the user responsible for having audio files within set max audio length.
Describe alternatives you've considered Modifying csv manually after is not practical when the dataset is already transcribed in full.