TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.85k stars 815 forks source link

FastSpeech2 Speaker IDs do not correlate with the voice at inference #366

Closed OscarVanL closed 4 years ago

OscarVanL commented 4 years ago

Hi,

I'm having some difficulties with inference on my FastSpeech2 model. I do not think the speaker IDs at inference correlate with those in libritts_mapper.json.

For example, my speakers_map says "2999": 65. Speaker 2999 is British male, but when doing inference with speaker_id 65, the voice is one of a Female American speaker instead.

mel_before, mel_outputs, duration_outputs, _, _ = fastspeech2.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids3, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([65], dtype=tf.int32),
    speed_ratios=tf.convert_to_tensor([1.5], dtype=tf.float32),
    f0_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32)
)

My speakers_map says "8382": 49, Speaker 8382 is a British female, but when doing inference with speaker_id 49 I get an American Male

If I use my own speaker "1": 50, I get a British sounding female. The speaker is made from recordings of my own voice (male), so this is also incorrect.

I think somehow the speaker IDs are getting mixed up. Is this something you've seen before?

machineko commented 4 years ago

https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/fastspeech2_libritts/fastspeech2_dataset.py#L119 => here you should load mapper from your processor mapper (as processors were added later this isn't done in master branch feel free to fix it :P ) Tmp fix is to just use this subfunction as a mapper of speakers.

OscarVanL commented 4 years ago

OK, so if I'm understanding correctly, when you process the dataset it creates libritts_mapper.json, but then when training the model this is not used and speaker IDs are re-generated, therefore not matching those in libritts_mapper.json?

machineko commented 4 years ago

OK, so if I'm understanding correctly, when you process the dataset it creates libritts_mapper.json, but then when training the model this is not used and speaker IDs are re-generated, therefore not matching those in libritts_mapper.json?

Ye when u train from example folder you are not loading mapper from processor (as u can see in #TODO all over the place :P)

OscarVanL commented 4 years ago

@GavinStein1 Hi, just letting you know what I told you in https://github.com/TensorSpeech/TensorFlowTTS/issues/296#issuecomment-723727913 about the speaker_id was not correct, due to this bug. I've made a PR which seems to fix it though :)