erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
1.13k stars 116 forks source link

FInetuning with existing transcript in large datasets. #363

Open Dolyfin opened 1 month ago

Dolyfin commented 1 month ago

Is your feature request related to a problem? Please describe. I have a dataset with uncommon words that I cannot expect Whisper or any ASR model to be able transcribe accurately. The dataset is already perfectly transcribed with each audio file having an accompanying .lab (label) file with the manual transcription in raw text.

The generating dataset step only allows modifications of the transcript csv after the wavs are automatically split. This makes the perfect existing transcriptions essentially useless and impossible to combine manually on large datasets. (dataset in hours).

Describe the solution you'd like Option to not use ASR for transcription and to use existing text files. If files need to be at a certain length for training. Make the user responsible for having audio files within set max audio length.

Describe alternatives you've considered Modifying csv manually after is not practical when the dataset is already transcribed in full.

erew123 commented 1 month ago

Hi @Dolyfin

You can manually jump to step 2 and populate your own CSV files, into the relevant boxes, however I apprecaite you are talking about something slightly different here. Also, I do intend to document that process a little better in the wiki (finetuning is still on my WIKI list to write at some point).

I cant find any details on "lab" file format (other than for label printers, which clearly isnt right. Have you any links to something about the file format or tell me some software that works with it, so I can get a better understanding about what you are suggesting.

Thanks

Dolyfin commented 1 month ago

.lab is just raw text here. It’s what Fishspeech uses in their fine tuning process. I would just assume .txt for text instead.

erew123 commented 1 month ago

Hi @Dolyfin

Sorry for the later reply, but Ive been dealing with other things in life for a while, see here https://github.com/erew123/alltalk_tts/issues/377

So, the underlying requirement for the formatting layout for Coqui XTTS training, is set by Coqui's scripts. Please see the reference here on their documentation https://docs.coqui.ai/en/latest/formatting_your_dataset.html#formatting-your-dataset (see the bit that says We recommend the following format delimited by |. In the following example, audio1, audio2 refer to files audio1.wav, audio2.wav etc...........)

The only way I could see to handle what you are describing would be to write a bit of script to rip through the .lab files and generate the resulting/required Coqui CSV files, which in principle, shouldn't be too hard. The only real decision the user would need to make would be the % to use for the Evaluation CVS and the % used for the Training CSV. It wouldnt be too hard to knock something together to do this.

Q. I assume you would have 1x folder that has all your dataset in it, populated with your audio and lab files? And of course, that would be the dataset to convert to the Coqui format.

Thanks