Closed anferico closed 1 year ago
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Any updates? At this stage, I'm just trying to understand if the difference in the implementation is wanted or if it's just a bug.
Solved by PR https://github.com/NVIDIA/NeMo/pull/7788
Describe the bug
The implementation of Global Style Tokens (GSTs) in FastPitch introduced in #6417 does not follow the prescription of the original paper. In particular, the difference lies in the choice of the reference audio for a given training/validation sample
<text, ground_truth_audio>
:I discovered this after training a FastPitch model with GSTs and observing that the choice of the reference audio at inference time would have virtually no impact at all. More precisely, given a text T and n different reference audios R_1, ..., R_n, passing <T, R_i> for any i in [1, n] would result in almost the exact same audio A. I say "almost" because the files appeared different when compared using
diff
, but they had the exact same length and they sounded exactly the same to the ear.So what I did was to modify the code that's responsible for selecting the reference audio in
nemo.collections.tts.data.dataset.TTSDataset.__getitem__
, effectively changing this:reference_audio = self.featurizer.process(self.data[reference_index]["audio_filepath"], ...)
to this:reference_audio = self.featurizer.process(sample["audio_filepath"], ...)
which is exactly what the GST paper prescribes. In this case, I observed that the choice of the reference audio at inference time did make a difference, however small (although this is probably due to the fact that the training set I used was not very varied in terms of speaking styles).
My intuition is that if an audio A1 is used to compute the style embedding that is then used to condition the generation of the Mel spectrogram for another audio A2, the model will learn to ignore that style embedding because the information that can be extracted from A1 is useless to generate the spectrogram for A2. You could argue that this isn't quite true because A1 and A2 come from the same speaker, hence you could extract speaker information from A1 to generate the spectrogram for A2, but FastPitch already contains a SpeakerLookup module that takes care of encoding speaker information.
Steps/Code to reproduce bug
examples/tts/conf/fastpitch_align_44100_adapter.yaml
Expected behavior
Choosing different reference audios at inference time results in different audios produced as output.
Environment overview (please complete the following information)
Environment details
Additional context
Add any other context about the problem here. GPU model: RTX 3090 Ti