NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
889 stars 177 forks source link

Speaker id argument #7

Open ilnmtlbnm opened 4 years ago

ilnmtlbnm commented 4 years ago

There is a Speaker id argument in inference.py : parser.add_argument('-i', '--id', help='Speaker id', type=int).

Whenever I try to change it to something other than 0, I get the following error :


Traceback (most recent call last):
  File "inference.py", line 122, in <module>
    args.n_frames, args.sigma, args.seed)
  File "inference.py", line 63, in infer
    speaker_vecs = trainset.get_speaker_id(speaker_id).cuda()
  File "/data/code/flowtron/data.py", line 83, in get_speaker_id
    return torch.LongTensor([self.speaker_ids[int(speaker_id)]])
KeyError: 2
karkirowle commented 4 years ago

If you are using the LJS model that might be expected as it is a single speaker model. You could try using the LibrITTS.

Quasimondo commented 4 years ago

Just a note - when using LibrITTS you will also have to change the n_speakers parameter in config.json to 123:

"model_config": { "n_speakers": 123, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 2, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true }

ilnmtlbnm commented 4 years ago

If you are using the LJS model that might be expected as it is a single speaker model. You could try using the LibrITTS.

image

Of course, thanks @karkirowle ! And thanks @Quasimondo for precising n_speakers for LibrITTS.

ilnmtlbnm commented 4 years ago

DOH! again, I closed to fast, still doesn't with LibrITTS.

python inference.py -c config.json -f models/flowtron_libritts.pt -w models/waveglow_256channels_v4.pt -t "But the machine only creates what humans have taught it to " -i 15 -n 777 -s 0.5

Quasimondo commented 4 years ago

Yeah - I realized that you will also have to adjust the "data_config" section: "training_files": "filelists/libritts_train_clean_100_audiopath_text_sid_shorterthan10s_atleast5min_train_filelist.txt"

And lastly you will have to pick a speaker ID that actually exists. They are not numbered consecutively, but you have to look them up in that filelist (it's the numbers at the end of each line)

ilnmtlbnm commented 4 years ago

Thanks again @Quasimondo

For reference, here are the valid ids for LibriTTS :


40 78 83 87 118 125 196 200 250 254 374 405 446 460 587 669 696 730 831 887 1069 1088 1116 1246 1263
 1502 1578 1841 1867 1963 1970 2092 2136 2182 2196 2289 2416 2436 2836 2843 2911 2952 3240 3242 3259
 3436 3486 3526 3664 3857 3879 3982 3983 4018 4051 4088 4160 4195 4267 4297 4362 4397 4406 4640 4680
 4788 5022 5104 5322 5339 5393 5652 5678 5703 5750 5808 6019 6064 6078 6081 6147 6181 6209 6272 6367
 6385 6415 6437 6454 6476 6529 6818 6836 6848 7059 7067 7078 7178 7190 7226 7278 7302 7367 7402 7447
 7505 7511 7794 7800 8051 8088 8098 8108 8123 8238 8312 8324 8419 8468 8609 8629 8770 8838
rafaelvalle commented 4 years ago

Thank you for compiling this list!

yhgon commented 4 years ago

I add additional script extract available sid. See below

https://github.com/yhgon/flowtron/blob/master/inference_colab.ipynb

import os
import sys

import pandas as pd 
import numpy as np 
import random
from itertools import cycle
from data import  load_filepaths_and_text

!cat /content/flowtron/filelists/libritts_speakerinfo.txt | tail -n +12  | head -n 10

filelist_path = "/content/flowtron/filelists/libritts_train_clean_100_audiopath_text_sid_shorterthan10s_atleast5min_train_filelist.txt"

def create_speaker_lookup_table(audiopaths_and_text):
    speaker_ids = np.sort(np.unique([x[2] for x in audiopaths_and_text]))
    d = {int(speaker_ids[i]): i for i in range(len(speaker_ids))}
    print("Number of speakers :", len(d))
    return d

audiopaths_and_text = load_filepaths_and_text(filelist_path)
speaker_ids  = create_speaker_lookup_table(audiopaths_and_text).keys() 
print(speaker_ids)
speakers = pd.read_csv('/content/flowtron/filelists/libritts_speakerinfo.txt', engine='python',header=None, comment=';', sep=' *\| *',  names=['ID', 'SEX', 'SUBSET', 'MINUTES', 'NAME'])
speakers['FLOWTRON_ID'] = speakers['ID'].apply(lambda x: x if x in speaker_ids else -1)

female_speakers =   speakers.query("SEX == 'F' and MINUTES > 20 and FLOWTRON_ID >= 0")['FLOWTRON_ID'].sample(frac=1).tolist() 
male_speakers   =   speakers.query("SEX == 'M' and MINUTES > 20 and FLOWTRON_ID >= 0")['FLOWTRON_ID'].sample(frac=1).tolist() 

print("females speakers : ", len(female_speakers), female_speakers )
print("male speakers    : ", len(male_speakers), male_speakers )