[TTS] Global Style Tokens implementation in FastPitch doesn't follow the original paper

anferico commented 1 year ago

Describe the bug

The implementation of Global Style Tokens (GSTs) in FastPitch introduced in #6417 does not follow the prescription of the original paper. In particular, the difference lies in the choice of the reference audio for a given training/validation sample <text, ground_truth_audio>:

In the original paper, it is recommended to use the ground-truth audio as the reference audio
In the NeMo implementation, the reference audio is chosen at random among audio samples from the same speaker

I discovered this after training a FastPitch model with GSTs and observing that the choice of the reference audio at inference time would have virtually no impact at all. More precisely, given a text T and n different reference audios R_1, ..., R_n, passing <T, R_i> for any i in [1, n] would result in almost the exact same audio A. I say "almost" because the files appeared different when compared using diff, but they had the exact same length and they sounded exactly the same to the ear.

So what I did was to modify the code that's responsible for selecting the reference audio in nemo.collections.tts.data.dataset.TTSDataset.__getitem__, effectively changing this: reference_audio = self.featurizer.process(self.data[reference_index]["audio_filepath"], ...) to this: reference_audio = self.featurizer.process(sample["audio_filepath"], ...)

which is exactly what the GST paper prescribes. In this case, I observed that the choice of the reference audio at inference time did make a difference, however small (although this is probably due to the fact that the training set I used was not very varied in terms of speaking styles).

My intuition is that if an audio A1 is used to compute the style embedding that is then used to condition the generation of the Mel spectrogram for another audio A2, the model will learn to ignore that style embedding because the information that can be extracted from A1 is useless to generate the spectrogram for A2. You could argue that this isn't quite true because A1 and A2 come from the same speaker, hence you could extract speaker information from A1 to generate the spectrogram for A2, but FastPitch already contains a SpeakerLookup module that takes care of encoding speaker information.

Steps/Code to reproduce bug

Train a FastPitch model using GSTs, for example using the following configuration file: examples/tts/conf/fastpitch_align_44100_adapter.yaml
Perform multiple inference steps where the input text is always the same, but the reference audio is different
Generate audios using a vocoder such as HiFi-GAN

Expected behavior

Choosing different reference audios at inference time results in different audios produced as output.

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of NeMo install: pip install

Environment details

OS version: Ubuntu 20.04.2
PyTorch version: 2.0.1
Python version: 3.8

Additional context

Add any other context about the problem here. GPU model: RTX 3090 Ti

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

anferico commented 1 year ago

Any updates? At this stage, I'm just trying to understand if the difference in the implementation is wanted or if it's just a bug.

hsiehjackson commented 1 year ago

Solved by PR https://github.com/NVIDIA/NeMo/pull/7788

NVIDIA / NeMo

[TTS] Global Style Tokens implementation in FastPitch doesn't follow the original paper #7420