Open Bagfish opened 4 months ago
Hello @Bagfish,
When it comes to training VoiceEncoder, you need to prepare a directory with a dataset, as described here - datasets (under the S2fDataset entry). In short, you must create a separate directory for each person, and inside each directory there must be two additional directories - one for calculated spectrograms (audios
directory) and one for calculated face embeddings (images
directory). Such directory can be used as a training set. If you want to prepare a validation or a test sets, just follow the same steps.
Note: To calculate spectrograms from audio files, you can use scripts located here: audio_spectrograms.py (for preprocessing like in Speech2Face: Learning the Face Behind a Voice paper) and ast_audio_preprocess.py (If you want to use AST voice encoder). On the other hand, face embeddings, which must be located in the images
directories, must be calculated using FaceEncoder model - here is the script image_face_embeddings.py
thank you for your relpy, i will follow your guide!!! @Kacper-Pietkun
@Kacper-Pietkun I’m very sorry to bother you again. Can I ask you for the vgg model which is converted to pytorch? The reasons why i cant convert it by myself is: 1. I can’t find the model download link from the(https://github.com/serengil/deepface) . 2. My computer has been unable to install TensorFlow effectively.
Here you will find PyTorch weights for the VGGFace_serengil
model: https://drive.google.com/drive/u/2/folders/1DCqvpZYkd0chupA3mQeCVS7p69WAjnER
I get it!!! Thank you,very much!!!
@Kacper-Pietkun When i train the speechencoder, the loss appear Nan. I really cant find whats the problem.
ve_conv
model. Try training ast
model instead.coe_1
, coe_2
, coe_3
, as well as the learning_rate
hyperparameter.@Kacper-Pietkun thank you for your reply 1.I have already trained FaceDecoder model,but I dont freeze the model weight. How can i freeze the facedecoder weight? 2.I will try the ast model instead. 3.I will try other hyperparameter. Thank you very much!!!!
During the first part, whole model is frozen except the head, which is trained. During the second part whole model is unfrozen and the model is fine-tuned.
What's this means? In first step , which args in train/train_ast.py should I set? And how to fintune using train/train_ast.py,just use "python train/train_ast.py --fine-tune" is ok?
@Kacper-Pietkun thank you for your reply 1.I have already trained FaceDecoder model,but I dont freeze the model weight. How can i freeze the facedecoder weight? 2.I will try the ast model instead. 3.I will try other hyperparameter. Thank you very much!!!!
Actually, I was wondering If you have trained FaceDecoder model beforehand, because it is necessary to calculate the loss function. You don't need to do anything extra to "freeze" FaceDecoder weight's, because optimizer was created only to optimize VoiceEncoder model's weights.
During the first part, whole model is frozen except the head, which is trained. During the second part whole model is unfrozen and the model is fine-tuned. What's this means? In first step , which args in train/train_ast.py should I set? And how to fintune using train/train_ast.py,just use "python train/train_ast.py --fine-tune" is ok?
Okay, so basically it looks like this. In the training script, AST VoiceEncoder model is downloaded from HuggingFace's transformer library, along with the pretrained weights. However, to adjust thee model to the problem of generating voice embedding vectors, it needs a new "head", so that the last layer's output dimension is equal to 4096 (just like face embedding vector size).
Here are the lines of code from the training script, which are responible for downloading model and swapping its "head". https://github.com/Kacper-Pietkun/Speech-to-face/blob/e0e32af54fd4b433705b77b6eba961a972c84336/src/train/train_ast.py#L242-L248
So, I splitted AST VoiceEncoder model training into two stages.
Here are a few lines from training script, which are responsible for freezeing all model's parameters except the head. Additionally you can see, that the head's parameters are initialized with truncated normal distribution: https://github.com/Kacper-Pietkun/Speech-to-face/blob/e0e32af54fd4b433705b77b6eba961a972c84336/src/train/train_ast.py#L249-L254
--unfreeze-number
parameter with which you can controll how many layers are unfrozen. (Actually this parameter specifies from which layer model should be unfrozen).Generally, to run the second stage, beyond all of the other necessary parameters like --train-dataset-path
, --face-decoder-weights-path
and so on, you need to pass these parameters to the script:
--fine-tune
- it is used as a flag to mark that model's head was already trained--continue-training-path
- it is used to specify path to the weight's of the ast model (the one which head was already trained)--unfreeze-number
- this one is optional, because by default when fine tuning, the whole model will be unfrozen. But as I said, you can use it as a hyperparameter. During my research I achieved the best results when I unfroze the model from the 165th layer.
Dear sir Thank you for your implementation based on PyTorch! I want to train the model, but i cant understand how to prepare the train data. In the paper, the speech and face image are paired, but in the first readme, i only see vox1,vox2and HQvox, which dataset is used to generate the face vector?