Generate an image of a human face based on that person's speech
The general aim of this project is to recreate and improve Speech-to-Face pipeline presented in the Speech2Face: Learning the Face Behind a Voice paper [1]
Whole implementation is based on PyTorch framework
In this project you will find implementation of three models:
Voice Encoder In this project we used two different models:
When using Speech-to-Face pipeline you can choose model which will be used
Face Encoder - architecture based on Deep Face Recognition paper [2]
In this project we didn't implement and train this model ourselves, we used existing trained models from:
VGGFace_serengil
) [4]VGGFace16_rcmalli
) [5]When using Speech-to-Face pipeline or Face-to-Face pipeline you can choose model which will be used
Face Decoder - architecture based on Synthesizing Normalized Faces from Facial Identity Features paper [3]
We trained this model from scratch
To read more about the project go to the page that you are interested in:
In the project we used three different datasets:
HQ-VoxCeleb
dataset was used to train FaceDecoder model. To train VoiceEncoder model we filtered VoxCeleb1
and VoxCeleb2
datasets to get audio files for the identities present in HQ-VoxCeleb
(because HQ-VoxCeleb
does not contain normalized face images for every identity present in VoxCeleb1
or VoxCeleb2
datasets)
We achieved the best results using fine-tuned AST as VoiceEncoder model. Moreover, we used VGGFace_serengil as the FaceEncoder when training the VoiceEncoder and FaceDecoder models. The results obtained when using our trained from scratch VE_conv model were much worse. In the image below you can see the conclusion of our work. In the left column you can see the original image of the person from the HQ-VoxCeleb dataset. In the middle column you can see the recostruction of the face from the Face-to-Face pipeline (i.e. convert image to the face embbedding and reconstruct the image - voice is not used in this pipeline). Finally in the right column you can see the results from the Speech-to-Face pipeline (i.e. convert speech to the spectrogram, calculate face embedding from that spectrogram, reconstruct the face).
[1] Oh, Tae-Hyun, et al. "Speech2face: Learning the face behind a voice." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[2] Parkhi, Omkar, Andrea Vedaldi, and Andrew Zisserman. "Deep face recognition." BMVC 2015-Proceedings of the British Machine Vision Conference 2015. British Machine Vision Association, 2015.
[3] Cole, Forrester, et al. "Synthesizing normalized faces from facial identity features." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[4] github.com/serengil/deepface
[5] github.com/rcmalli/keras-vggface
[6] robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html
[7] robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html
[8] Bai, Yeqi, et al. "Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging." Proceedings of the 30th ACM International Conference on Multimedia. 2022.