NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.05k stars 2.51k forks source link

TTS - Mixing datasets for FastPitch + HiFiGAN #3688

Closed gedefet closed 2 years ago

gedefet commented 2 years ago

Discussed in https://github.com/NVIDIA/NeMo/discussions/3678

Originally posted by **gedefet** February 15, 2022 Hi All.
At the end of the FastPitch_Finetuning.ipynb tutorial: https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_Finetuning.ipynb
it suggest two different ways to improve performance, one of which is:
_Mix new speaker data with old speaker data We recommend to train fastpitch using both old speaker data (LJSpeech in this notebook) and the new speaker data. In this case, please modify the .json when finetuning fastpitch to include speaker information:_
but it isn't 100% clear how to do that.
I found another FastPitch_Finetuning.ipynb here: https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/FastPitch_Finetuning.ipynb
Which is different than the other.
It seems that it has two functions defined to do the mixing, but it seems that this colab is abandoned. Nevertheless, I think that has very useful code, and that has tackled this part of mixing datasets.
Is this true? Or do I need something more that is not there?Why it is abandoned?
Thanks!
godspirit00 commented 2 years ago

This is what I created to mix the datasets based on my understanding of the tutorial:

import json
import os

oldjson=input("Path to the manifest JSON of the old speaker: ").strip()
if oldjson=="": exit(0)
newjson=input("Path to the manifest JSON of the new speaker: ").strip()
if newjson=="": exit(0)
num_hours=input("How many hours? (default 5)").strip()
if num_hours=="":
    num_hours=5
else:
    num_hours=float(num_hours)

print(f"Gathering {num_hours} hours of recording from old speaker...")
with open(oldjson,encoding="utf8") as oif:
    oj=oif.readlines()

counter=0
outold=[]
for o in oj:
    if o.strip()=="": continue
    j=json.loads(o)
    if counter<=num_hours*60*60:
        try:
            del j['pred_text']
        except:
            pass
        j['speaker']=0
        outold.append(json.dumps(j))
        counter+=j['duration']
    else:
        break

print(f"Gathering {len(outold)} entries of recording from new speaker...")
with open(newjson,encoding="utf8") as nif:
    nj=nif.readlines()
outnew=[]
while 1:
    for n in nj:
        if len(outnew)>len(outold):
            break
        j=json.loads(n)
        j['speaker']=1
        try:
            del j['pred_text']
        except:
            pass
        outnew.append(json.dumps(j))
    if len(outnew)>len(outold):
        break

print("Outputting...")
outdir=os.path.split(newjson)[0]
fn=(os.path.split(newjson)[1]).replace(".json","")+"--with--"+(os.path.split(oldjson)[1]).replace(".json","")+".json"
with open(os.path.join(outdir,fn),"w",encoding="utf8") as out:
    out.write("\n".join(outold))
    out.write("\n")
    out.write("\n".join(outnew))

print("Done!")

I'm not sure if it is the best way to do the mixing, but I have tried finetuning fastpitch with it and the result sounds good. Hope this helps.

gedefet commented 2 years ago

@godspirit00 thank you very much.

I made to work the functions make_sub_file_list() and mix_file_list() from here https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/FastPitch_Finetuning.ipynb,

that I think essentialy does the same. The first one let you extract a subdataset of n minutes or m audio samples of the complete HiFiTTS.

But, It mixes different speakers from the HiFiTTS dataset. This is, the new dataset.

I think the idea is to mix it with the LJSpeech dataset used in the checkpoint you downloaded for doing the finetuning from that, is that correct? And then do the finetuning not in 5 minutes of your audio but in 5h from the LJSpeech + 30min of your audio? This would be around 330k training steps for FastPitch.

I'm a bit lost there. Maybe @Oktai15 can throw some light here?

BTW, how did you mix it with the LJSpeech dataset? Because it consists only of a list of audios and a .csv.

Thanks!

godspirit00 commented 2 years ago

@gedefet

BTW, how did you mix it with the LJSpeech dataset? Because it consists only of a list of audios and a .csv.

See __process_data() in scripts/dataset_processing/tts/ljspeech/get_data.py.