NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.55k stars 2.42k forks source link

[TTS] Global Style Tokens implementation in FastPitch doesn't follow the original paper #7420

Closed anferico closed 11 months ago

anferico commented 1 year ago

Describe the bug

The implementation of Global Style Tokens (GSTs) in FastPitch introduced in #6417 does not follow the prescription of the original paper. In particular, the difference lies in the choice of the reference audio for a given training/validation sample <text, ground_truth_audio>:

I discovered this after training a FastPitch model with GSTs and observing that the choice of the reference audio at inference time would have virtually no impact at all. More precisely, given a text T and n different reference audios R_1, ..., R_n, passing <T, R_i> for any i in [1, n] would result in almost the exact same audio A. I say "almost" because the files appeared different when compared using diff, but they had the exact same length and they sounded exactly the same to the ear.

So what I did was to modify the code that's responsible for selecting the reference audio in nemo.collections.tts.data.dataset.TTSDataset.__getitem__, effectively changing this: reference_audio = self.featurizer.process(self.data[reference_index]["audio_filepath"], ...) to this: reference_audio = self.featurizer.process(sample["audio_filepath"], ...)

which is exactly what the GST paper prescribes. In this case, I observed that the choice of the reference audio at inference time did make a difference, however small (although this is probably due to the fact that the training set I used was not very varied in terms of speaking styles).

My intuition is that if an audio A1 is used to compute the style embedding that is then used to condition the generation of the Mel spectrogram for another audio A2, the model will learn to ignore that style embedding because the information that can be extracted from A1 is useless to generate the spectrogram for A2. You could argue that this isn't quite true because A1 and A2 come from the same speaker, hence you could extract speaker information from A1 to generate the spectrogram for A2, but FastPitch already contains a SpeakerLookup module that takes care of encoding speaker information.

Steps/Code to reproduce bug

  1. Train a FastPitch model using GSTs, for example using the following configuration file: examples/tts/conf/fastpitch_align_44100_adapter.yaml
  2. Perform multiple inference steps where the input text is always the same, but the reference audio is different
  3. Generate audios using a vocoder such as HiFi-GAN

Expected behavior

Choosing different reference audios at inference time results in different audios produced as output.

Environment overview (please complete the following information)

Environment details

Additional context

Add any other context about the problem here. GPU model: RTX 3090 Ti

github-actions[bot] commented 11 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

anferico commented 11 months ago

Any updates? At this stage, I'm just trying to understand if the difference in the implementation is wanted or if it's just a bug.

hsiehjackson commented 11 months ago

Solved by PR https://github.com/NVIDIA/NeMo/pull/7788