Closed gedefet closed 2 years ago
This is what I created to mix the datasets based on my understanding of the tutorial:
import json
import os
oldjson=input("Path to the manifest JSON of the old speaker: ").strip()
if oldjson=="": exit(0)
newjson=input("Path to the manifest JSON of the new speaker: ").strip()
if newjson=="": exit(0)
num_hours=input("How many hours? (default 5)").strip()
if num_hours=="":
num_hours=5
else:
num_hours=float(num_hours)
print(f"Gathering {num_hours} hours of recording from old speaker...")
with open(oldjson,encoding="utf8") as oif:
oj=oif.readlines()
counter=0
outold=[]
for o in oj:
if o.strip()=="": continue
j=json.loads(o)
if counter<=num_hours*60*60:
try:
del j['pred_text']
except:
pass
j['speaker']=0
outold.append(json.dumps(j))
counter+=j['duration']
else:
break
print(f"Gathering {len(outold)} entries of recording from new speaker...")
with open(newjson,encoding="utf8") as nif:
nj=nif.readlines()
outnew=[]
while 1:
for n in nj:
if len(outnew)>len(outold):
break
j=json.loads(n)
j['speaker']=1
try:
del j['pred_text']
except:
pass
outnew.append(json.dumps(j))
if len(outnew)>len(outold):
break
print("Outputting...")
outdir=os.path.split(newjson)[0]
fn=(os.path.split(newjson)[1]).replace(".json","")+"--with--"+(os.path.split(oldjson)[1]).replace(".json","")+".json"
with open(os.path.join(outdir,fn),"w",encoding="utf8") as out:
out.write("\n".join(outold))
out.write("\n")
out.write("\n".join(outnew))
print("Done!")
I'm not sure if it is the best way to do the mixing, but I have tried finetuning fastpitch with it and the result sounds good. Hope this helps.
@godspirit00 thank you very much.
I made to work the functions make_sub_file_list()
and mix_file_list()
from here https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/FastPitch_Finetuning.ipynb,
that I think essentialy does the same. The first one let you extract a subdataset of n minutes or m audio samples of the complete HiFiTTS.
But, It mixes different speakers from the HiFiTTS dataset. This is, the new dataset.
I think the idea is to mix it with the LJSpeech dataset used in the checkpoint you downloaded for doing the finetuning from that, is that correct? And then do the finetuning not in 5 minutes of your audio but in 5h from the LJSpeech + 30min of your audio? This would be around 330k training steps for FastPitch.
I'm a bit lost there. Maybe @Oktai15 can throw some light here?
BTW, how did you mix it with the LJSpeech dataset? Because it consists only of a list of audios and a .csv.
Thanks!
@gedefet
BTW, how did you mix it with the LJSpeech dataset? Because it consists only of a list of audios and a .csv.
See __process_data()
in scripts/dataset_processing/tts/ljspeech/get_data.py
.
Discussed in https://github.com/NVIDIA/NeMo/discussions/3678
At the end of the FastPitch_Finetuning.ipynb tutorial: https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_Finetuning.ipynb
it suggest two different ways to improve performance, one of which is:
_Mix new speaker data with old speaker data We recommend to train fastpitch using both old speaker data (LJSpeech in this notebook) and the new speaker data. In this case, please modify the .json when finetuning fastpitch to include speaker information:_
but it isn't 100% clear how to do that.
I found another FastPitch_Finetuning.ipynb here: https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/FastPitch_Finetuning.ipynb
Which is different than the other.
It seems that it has two functions defined to do the mixing, but it seems that this colab is abandoned. Nevertheless, I think that has very useful code, and that has tackled this part of mixing datasets.
Is this true? Or do I need something more that is not there?Why it is abandoned?
Thanks!