Closed erkankaracakan closed 4 years ago
Our model generates speech (Mel-spectrograms followed by raw speech) from lip movements. Our network uses tacotron 2's decoder and a 3D CNN based encoder which takes video frames as input. The output from the network is a mel-spectrograms which is then converted to raw speech using a vocoder. There is no text involved anywhere in the pipeline.
First of all thank you for the project.
As i understand, project doing lip reading and creating text first and then text-to-speech with Tacotron. I'm trying to get generated text from lip reading. Is it possible?
Also do i need text which includes speeches in videos for training my own data?
Thank you.