Open jpc opened 5 months ago
Hi, for pt. 2 (freezing one style), have you considered StyleTTS 2's approach (see section B.3)?
Our findings indicate that style diffusion creates significant variation in samples, a characteristic that poses challenges for long-form synthesis. In this scenario, a long paragraph is usually divided into smaller sentences for generation, sentence by sentence, in the same way as real-time applications. Using an independent style for each sentence may generate speech that appears inconsistent due to differences in speaking styles. Conversely, maintaining the same style from the first sentence throughout the entire paragraph results in monotonic, unnatural, and robotic-sounding speech.
We empirically observe that the latent space underlying the style vectors generally forms a convex space. Consequently, a convex combination of two style vectors yields another style vector, with the speaking style somewhere between the original two. This allows us to condition the style of the current sentence on the previous sentence through a simple convex combination. The pseudocode of this algorithm, which uses interpolated style vectors, is provided in Algorithm 1.
Hey, thanks for the tip. I skimmed the StyleTTS 2 paper before but maybe I'll read it again more carefully. :)
This could also allow us to: