Condition decoder on desired output length to have control over speech rate in inference?

I have a question regarding a possible extension of FreeVC. Would it be possible to train the system so that we have control over the length of the output in the inference step?

I'm envisioning using a similar method as done with the vertical SR augmentation and the speaker embedding. We could augment the input horizontally, and add a length label (length embedding, in contrast to speaker embedding), and introduce another projection layer before or after the bottleneck that learns to transform the embeddings between different lengths (conditioned on the length label). The posterior encoder would be unchanged.

Would such a system be possible in theory so that in the inference step we can also control the speech rate?

OlaWod / FreeVC

Condition decoder on desired output length to have control over speech rate in inference? #70