I was thinking that maybe I could download a bunch of videos like this one, with all their transcriptions and use it to feed tacotron. Those videos are targeted to pre-K audiences, so the voice actors use some particular inflections in some cases, and silly voices. Since those voice actors are doing that stuff for a living, they might use the same patters over and over in a sort of automated way, so when they see capitalized words or quote marks they might be using the same inflections or expressiveness.
Do you think tacotron could come up with something that resembles those actors' performances?
How big should be the data for the training?
As a side note, they just published a paper for the new version of tacotron, are there any plans for using tensorflow for that implementation too?
I was thinking that maybe I could download a bunch of videos like this one, with all their transcriptions and use it to feed tacotron. Those videos are targeted to pre-K audiences, so the voice actors use some particular inflections in some cases, and silly voices. Since those voice actors are doing that stuff for a living, they might use the same patters over and over in a sort of automated way, so when they see capitalized words or quote marks they might be using the same inflections or expressiveness.
Do you think tacotron could come up with something that resembles those actors' performances?
How big should be the data for the training?
As a side note, they just published a paper for the new version of tacotron, are there any plans for using tensorflow for that implementation too?
Keep up the good work.