Investigate prompting as a tool to zero-shot condition both the S2A and T2S models

jpc commented 5 months ago

This could also allow us to:

zero-shot voice (and prosody) clone existing recording
generate some random samples and then freeze one style we like most for subsequent generations.

fakerybakery commented 5 months ago

Hi, for pt. 2 (freezing one style), have you considered StyleTTS 2's approach (see section B.3)?

Our findings indicate that style diffusion creates significant variation in samples, a characteristic that poses challenges for long-form synthesis. In this scenario, a long paragraph is usually divided into smaller sentences for generation, sentence by sentence, in the same way as real-time applications. Using an independent style for each sentence may generate speech that appears inconsistent due to differences in speaking styles. Conversely, maintaining the same style from the first sentence throughout the entire paragraph results in monotonic, unnatural, and robotic-sounding speech.

We empirically observe that the latent space underlying the style vectors generally forms a convex space. Consequently, a convex combination of two style vectors yields another style vector, with the speaking style somewhere between the original two. This allows us to condition the style of the current sentence on the previous sentence through a simple convex combination. The pseudocode of this algorithm, which uses interpolated style vectors, is provided in Algorithm 1.

jpc commented 5 months ago

Hey, thanks for the tip. I skimmed the StyleTTS 2 paper before but maybe I'll read it again more carefully. :)

collabora / WhisperSpeech

Investigate prompting as a tool to zero-shot condition both the S2A and T2S models #52