how to prepare the train data?

Bagfish commented 4 months ago

Dear sir Thank you for your implementation based on PyTorch! I want to train the model, but i cant understand how to prepare the train data. In the paper, the speech and face image are paired, but in the first readme, i only see vox1,vox2and HQvox, which dataset is used to generate the face vector?

Kacper-Pietkun commented 4 months ago

Hello @Bagfish, When it comes to training VoiceEncoder, you need to prepare a directory with a dataset, as described here - datasets (under the S2fDataset entry). In short, you must create a separate directory for each person, and inside each directory there must be two additional directories - one for calculated spectrograms (audios directory) and one for calculated face embeddings (images directory). Such directory can be used as a training set. If you want to prepare a validation or a test sets, just follow the same steps.

Note: To calculate spectrograms from audio files, you can use scripts located here: audio_spectrograms.py (for preprocessing like in Speech2Face: Learning the Face Behind a Voice paper) and ast_audio_preprocess.py (If you want to use AST voice encoder). On the other hand, face embeddings, which must be located in the images directories, must be calculated using FaceEncoder model - here is the script image_face_embeddings.py

Bagfish commented 4 months ago

thank you for your relpy, i will follow your guide!!! @Kacper-Pietkun

Bagfish commented 4 months ago

@Kacper-Pietkun I’m very sorry to bother you again. Can I ask you for the vgg model which is converted to pytorch? The reasons why i cant convert it by myself is: 1. I can’t find the model download link from the(https://github.com/serengil/deepface) . 2. My computer has been unable to install TensorFlow effectively.

Kacper-Pietkun commented 4 months ago

Here you will find PyTorch weights for the VGGFace_serengil model: https://drive.google.com/drive/u/2/folders/1DCqvpZYkd0chupA3mQeCVS7p69WAjnER

Bagfish commented 4 months ago

I get it!!! Thank you,very much!!!

Bagfish commented 4 months ago

@Kacper-Pietkun When i train the speechencoder, the loss appear Nan. I really cant find whats the problem. loss

Kacper-Pietkun commented 4 months ago

Have you trained FaceDecoder model beforehand? (When training VoiceEncoder, FaceDecoder model's weights should be frozen).
What VoiceEncoder model are you training? I had similar problems with ve_conv model. Try training ast model instead.
One approach that should help is playing with values of the coefficients of the loss function - coe_1, coe_2, coe_3, as well as the learning_rate hyperparameter.

Bagfish commented 4 months ago

@Kacper-Pietkun thank you for your reply 1.I have already trained FaceDecoder model,but I dont freeze the model weight. How can i freeze the facedecoder weight? 2.I will try the ast model instead. 3.I will try other hyperparameter. Thank you very much!!!!

Bagfish commented 4 months ago

During the first part, whole model is frozen except the head, which is trained. During the second part whole model is unfrozen and the model is fine-tuned.
What's this means? In first step , which args in train/train_ast.py should I set? And how to fintune using train/train_ast.py,just use "python train/train_ast.py --fine-tune" is ok?

Kacper-Pietkun commented 4 months ago

@Kacper-Pietkun thank you for your reply 1.I have already trained FaceDecoder model,but I dont freeze the model weight. How can i freeze the facedecoder weight? 2.I will try the ast model instead. 3.I will try other hyperparameter. Thank you very much!!!!

Actually, I was wondering If you have trained FaceDecoder model beforehand, because it is necessary to calculate the loss function. You don't need to do anything extra to "freeze" FaceDecoder weight's, because optimizer was created only to optimize VoiceEncoder model's weights.

During the first part, whole model is frozen except the head, which is trained. During the second part whole model is unfrozen and the model is fine-tuned. What's this means? In first step , which args in train/train_ast.py should I set? And how to fintune using train/train_ast.py,just use "python train/train_ast.py --fine-tune" is ok?

Okay, so basically it looks like this. In the training script, AST VoiceEncoder model is downloaded from HuggingFace's transformer library, along with the pretrained weights. However, to adjust thee model to the problem of generating voice embedding vectors, it needs a new "head", so that the last layer's output dimension is equal to 4096 (just like face embedding vector size).

Here are the lines of code from the training script, which are responible for downloading model and swapping its "head". https://github.com/Kacper-Pietkun/Speech-to-face/blob/e0e32af54fd4b433705b77b6eba961a972c84336/src/train/train_ast.py#L242-L248

So, I splitted AST VoiceEncoder model training into two stages.

The first stage is responsilbe only for training this new "head" of the model (other layers are frozen). This is typically done during transfer learning, to avoid a situation where, due to a freshly initialized layer (head), the size of the gradient updates would be so large, that the other previously trained parameters would be altered too much, and the model would forget what it had learned. (Remember that the other parameters which are frozen during this step were initialized with pretrained weights).

Here are a few lines from training script, which are responsible for freezeing all model's parameters except the head. Additionally you can see, that the head's parameters are initialized with truncated normal distribution: https://github.com/Kacper-Pietkun/Speech-to-face/blob/e0e32af54fd4b433705b77b6eba961a972c84336/src/train/train_ast.py#L249-L254

During the second stage, more of the model's parameters should be unfrozen, so that they can be optimized for the problem of generating voice embedding. You can unfreeze the whole model, or only some parts of it. Recently I have added to the training script --unfreeze-number parameter with which you can controll how many layers are unfrozen. (Actually this parameter specifies from which layer model should be unfrozen).

Generally, to run the second stage, beyond all of the other necessary parameters like --train-dataset-path, --face-decoder-weights-path and so on, you need to pass these parameters to the script:

--fine-tune - it is used as a flag to mark that model's head was already trained
--continue-training-path - it is used to specify path to the weight's of the ast model (the one which head was already trained)
--unfreeze-number - this one is optional, because by default when fine tuning, the whole model will be unfrozen. But as I said, you can use it as a hyperparameter. During my research I achieved the best results when I unfroze the model from the 165th layer.

Kacper-Pietkun / Speech-to-face

how to prepare the train data? #1