Closed JasonBlain closed 7 months ago
I'm not sure if Piper onnx models can actually do that. In the Piper pipeline you first transfer text to phonemes using e-speak. Next you transfer phonemes to token ids and inference network on this.
The final output tensor from network doesn't seem to have timestamps. It returns only raw PCM audio.
I can add function to get phonemes from text, but I don't know how to predict their timestamps on a final audio. Please share if you have some ideas how to do that.
I'm working on reshaping the model and will advise.
Thanks so much for your integration of Unity/ONNX and Piper!
I'm trying to see how to get the actual phoneme timings out and I'm reading through how your scripts contact the ONNX model and I'm not sure where to put things but it'd be nice to spit out one extra float array of the individual appended phoneme lengths so the voice output can be visually synced to viseme output on an avatar.
Any clues how to attack that? It's literally just adding one final output float array in the right spot to grab - no need to edit the model or tensors or anything like that.