Error running Speech2Face.ipynb

martindisley commented 3 years ago

Hi Antonio,

I've been trying to reproduce your work on Google Colab. I've downloaded all the datasets by adding the following cell:

if not os.path.isfile('/content/drive/MyDrive/Unsound/voxceleb/vox.zip'):
  !curl --user voxceleb1912:0s42xuw6 -o "/content/drive/MyDrive/Unsound/voxceleb/vox.zip" http://balthasar.tplinkdns.com/voxceleb/vox1a/vox1_test_wav.zip

if not os.path.isfile('/content/drive/MyDrive/Unsound/voxceleb/zippedFaces.tar.gz'):
  !curl -o "/content/drive/MyDrive/Unsound/voxceleb/zippedFaces.tar.gz" https://www.robots.ox.ac.uk/~vgg/research/CMBiometrics/data/zippedFaces.tar.gz

if not os.path.isfile('/content/drive/MyDrive/Unsound/voxceleb/vox1_meta.csv'):
  !curl -o "/content/drive/MyDrive/Unsound/voxceleb/vox1_meta.csv" https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/vox1_meta.csv

if not os.path.isfile('/content/drive/MyDrive/Unsound/voxceleb/vgg_face_dag.pth'):
  !curl -o '/content/drive/MyDrive/Unsound/voxceleb/vgg_face_dag.pth' https://www.robots.ox.ac.uk/~albanie/models/pytorch-mcn/vgg_face_dag.pth

!cp "/content/drive/MyDrive/Unsound/voxceleb/vox.zip" /content
!cp "/content/drive/MyDrive/Unsound/voxceleb/zippedFaces.tar.gz" /content
!cp "/content/drive/MyDrive/Unsound/voxceleb/vox1_meta.csv" /content
!cp '/content/drive/MyDrive/Unsound/voxceleb/vgg_face_dag.pth' /content

After that everything runs perfectly until I hit an error at the following cell:

check_path = None #"/content/drive/My Drive/Speech2Face/models/speech_encoder/adam_epoch_2.pth"

def _load_checkpoint(checkpoint_path):
  checkpoint = torch.load(checkpoint_path)
  global_ep = checkpoint["epoch"]
  model.load_state_dict(checkpoint["model_state_dict"])
  optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
  print("Loaded checkpoint, restart epoch is: ", global_ep)
  return global_ep + 1

global_ep = 0
if check_path is not None:
  global_ep = _load_checkpoint(check_path)

model.to(device)
# data = data.to(device)
fit(10, datal, model, vgg, optimizer, "Speech2Face Training", restart_epoch=global_ep)

This is the error:

VGG.training =  False
Speech2Face.training =  True
Starting training. Restart epoch: 0
train epoch: 0%
0/10 [00:00<?, ?it/s]
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
batch [loss: None]: 0%
0/1625 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-18-1677fff5fac7> in <module>()
     15 model.to(device)
     16 # data = data.to(device)
---> 17 fit(10, datal, model, vgg, optimizer, "Speech2Face Training", restart_epoch=global_ep)

5 frames
<ipython-input-16-50ad598b7f62> in fit(epochs, train_dl, model, VGG, opt, tag, device, restart_epoch)
     85             output = model(wav)
     86 
---> 87             fvgg_s = VGG(output, True)
     88             fvgg_f = VGG(embedding, True)
     89 

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

<ipython-input-6-ca5e7f74b2bf> in forward(self, x0, is_fc8)
     46 
     47     def forward(self, x0, is_fc8=False):
---> 48         x1 = self.conv1_1(x0)
     49         x2 = self.relu1_1(x1)
     50         x3 = self.conv1_2(x2)

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py in forward(self, input)
    441 
    442     def forward(self, input: Tensor) -> Tensor:
--> 443         return self._conv_forward(input, self.weight, self.bias)
    444 
    445 class Conv3d(_ConvNd):

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py in _conv_forward(self, input, weight, bias)
    438                             _pair(0), self.dilation, self.groups)
    439         return F.conv2d(input, weight, bias, self.stride,
--> 440                         self.padding, self.dilation, self.groups)
    441 
    442     def forward(self, input: Tensor) -> Tensor:

RuntimeError: Expected 4-dimensional input for 4-dimensional weight [64, 3, 3, 3], but got 2-dimensional input of size [3, 4096] instead

Any help you can give would be much appreciated.

gpriamo commented 3 years ago

Hi Martin,

I'm Antonio's colleague, we developed this repository together for a university exam (which we took one year ago) and, unfortunately, we have not worked on it ever since. For this reason, both of us remember very little about this repo and, due to lack of time, we also left it in a very messy state (sorry!).

I tried to squeeze my memories as much as I could to provide any help, these are some tips I hope you find helpful:

Firstly, you need to train the Decoder, I suggest you to use this notebook as this decoder gave us the best results between those we used. The decoder's task is reconstructing a face image starting from its 4096-D feature embedding.
You may want to look at this other notebook to understand what kind of pre-processing we did to the images we used to train the decoder, and also this other one which we used to pre-process the wav files containing the voices.
As for the Speech2Face model, the final version we trained is the one contained in this notebook.

Hopefully these tips will help you overcome some of the issues you may encounter while trying to train the models by yourself.

Good luck!

martindisley commented 2 years ago

Hi Giacomo,

Thanks for getting back to me. This is really helpful. I've got a couple of further questions:

Looking at the pre-processing notebook, you've got the variable PICS_PER_ACTOR set to 40. In the decoder notebook you use face_features_10_per_actor.zip. Did you find you got better results with PICS_PER_ACTOR set to 10? Should I set it to that in the preprocessing notebook?
Once we've trained the decoder do you think we can use the trained encoder from this repo to complete the pipeline?

gpriamo commented 2 years ago

Hi Martin,

No, we used 10 because the models needed a significantly long time to train and we only had Colab base version to train them (meaning very long training times and frequent stops while training), so we had to reduce the number of training samples. Sadly this produced unsatisfying results, so my suggestion is: train the models with as many samples as possible and give it all the time it needs.
If I remember correctly, I think we were aware of the existence of that repo while we were developing our models, so there may be some similarities, but I am not sure. It is possible that we did something different in terms of transformations to the dataset.

martindisley commented 2 years ago

Hi Giacomo,

Thanks for your help. Think I managed to run the preprocessing notebook alright. I'm now looking at training the decoder.

Do you know what version of python you were running for this? I got a conflict trying to install tensorflow-gpu=1.15 inside a Python 3.9 env. I created another 2.7 env and was able to install it fine but ran into another error:

<ipython-input-4-af5e6643b47e> in __init__(self, data_list_path, size)
     25 class EmbedImagePairs(Dataset):
     26     def __init__(self, data_list_path, size=64):
---> 27         super().__init__()
     28         self.face_features = np.load(join(data_list_path,'facefeature.npy'))
     29         self.image_path = join(data_list_path, 'Faces')

TypeError: super() takes at least 1 argument (0 given)

According to this super() requires at least one argument if before Python 3.0 do you know which version of Python >3.0 works with the required version of tensorflow-gpu.

Cheers!

gpriamo commented 2 years ago

Hi Martin,

I believe we used python 3.7 (or even 3.6), but I am not certain.

Be aware that we developed this project during Summer 2020, hopefully this information can help you in narrowing your search for the correct version.

martindisley commented 2 years ago

Hi Giacomo,

Thank you for your patience thus far. Got the decoder training! I've got a couple more questions if you've got time:

How many epochs did you train it for/how many do you reckon I should aim for?
Is vgg_face_dag.pth a pre-trained encoder? Or do I need to train a separate encoder?

Cheers

gpriamo commented 2 years ago

Hi Martin,

Let me start by saying that I suggest you to read the paper for more accurate figures. Anyways, from our experience I can tell you it is not a matter of epochs rather than a matter of training samples: according to the notes I wrote, the authors used "1.7 million spectra–face feature pairs". In our case, we used only 45’000 (due to Colab storage & loading times limits) and trained the voice encoder for 3 epochs but, as I've already told you, the results we got were not particularly good.
Yes, that model is pre-trained.

Let us know how it goes!

martindisley commented 2 years ago

Hi Giacomo,

I'm trying to recreate the wav_filtered_20_per_actor dir that goes into Speech2Face_newDataset.ipynb here:

data = Speech2FaceDataset('wav_filtered_20_per_actor', 'Face_Feature','vox1_meta.csv')
data_test = Speech2FaceDataset('wav', 'Face_Feature','vox1_meta.csv')

It's the only thing I'm missing and I hope I'd be able to use dataset_filtering.py to get it but you refer to the undeclared variable main_dir (here) and it throws an error. I guess, from the following line, that's the dir that lists the actors by name, and sorts them accordingly. I'd thought about using the unzippedFaces dir instead, as it has the list of actors, but noticed it has far more sub dirs than wav has. So I haven't used that as I guess it will result in a miss labelling. I presume this is because I've got the full faces dataset and only the test dataset for speech files. I got the speech dataset using this line:

!curl --user voxceleb1912:0s42xuw6 -o "/content/drive/My Drive/Speech2Face/ff/vox.zip" 
http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_test_wav.zip

Do you know what main_dir refers to. If not, do you think I should try and get the full speech dataset and then use the full face dataset as a reference for the actor's names? Did you use the full faces dataset or just the test set?

gpriamo commented 2 years ago

Hi Martin,

I am sorry but I do not remember much about this part. We definitely used the vox_celeb dataset, I can't tell if we used the full train dataset or the test one though. The same goes for the faces dataset (anyway, I suggest you to try using the train ones to have as many training samples as possible for the reasons I mentioned in my past comments).

As for that main_dir variable, it looks like it's only used to get the list of actors, since we then get the wav files using this line and the following one, meaning you pass the path to the vox celeb dataset directory using the path_wavs parameter and then the python script retrieves the wav files by itself. Therefore I suppose that if the unzippedFaces dir contains a list of actors' names/ids, that's the one you're looking for.

martindisley commented 2 years ago

Hi Giacomo,

Thanks for your help so far, I feel like I'm so close to having this working! I believe I have successfully trained both the encoder and the decoder but I've run into an error running the following cell:

# Quick testing -- actual test

enc.eval()
# dec.eval()
dec_w.eval() 

test_wav_path = "/home/martin/workspace/Unsound/voxceleb/wav_full/id10001/1zcIwhmdeo4/00001.wav"
test_wav = load_wav(test_wav_path)
test_wav = torch.tensor(test_wav).reshape(2,257,601).float().unsqueeze(0).to(device)

test_wav_path2 = "/home/martin/workspace/Unsound/voxceleb/wav_test/id10270/5r0dWxy17C8/00001.wav"
test_wav2 = load_wav(test_wav_path2)
test_wav2 = torch.tensor(test_wav2).reshape(2,257,601).float().unsqueeze(0).to(device)

print(torch.equal(test_wav, test_wav2))

#print(test_wav, test_wav2)

out = enc(test_wav)
decoded_w = dec_w(out)

out2 = enc(test_wav2)
decoded_w2 = dec_w(out2)

Here's the error:

RuntimeError: Calculated padded input size per channel: (3 x 289). Kernel size: (4 x 4). Kernel size can't be greater than actual input size
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_2678/3578311790.py in <module>
     22 
     23 
---> 24 out2 = enc(test_wav2)
     25 decoded_w2 = dec_w(out2)

~/anaconda3/envs/speech2face3.7-2/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/tmp/ipykernel_2678/904156634.py in forward(self, x)
     51         out = self.batch_norm7(self.relu(self.conv7(out)))
     52         out = self.batch_norm8(self.relu(self.conv8(out)))
---> 53         out = self.conv9(out)
     54         out = self.batch_norm9(self.relu(self.pooling5(out)))
     55 

~/anaconda3/envs/speech2face3.7-2/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~/anaconda3/envs/speech2face3.7-2/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input)
    441 
    442     def forward(self, input: Tensor) -> Tensor:
--> 443         return self._conv_forward(input, self.weight, self.bias)
    444 
    445 class Conv3d(_ConvNd):

~/anaconda3/envs/speech2face3.7-2/lib/python3.7/site-packages/torch/nn/modules/conv.py in _conv_forward(self, input, weight, bias)
    438                             _pair(0), self.dilation, self.groups)
    439         return F.conv2d(input, weight, bias, self.stride,
--> 440                         self.padding, self.dilation, self.groups)
    441 
    442     def forward(self, input: Tensor) -> Tensor:

RuntimeError: Calculated padded input size per channel: (3 x 289). Kernel size: (4 x 4). Kernel size can't be greater than actual input size

I've read here that changing the kernal size in the encoder declaration might solve this, but my assumption would be that I'd need to retrain the encoder with this new encoder in order to test? Do you know of another solution?

Thanks again!

gpriamo commented 2 years ago

Hi Martin,

I am sorry but I do not know how to actually help you with this part. It looks like the problem lies within the second sample (_testwav2), so you might try using another wav file.

Also, try having a look at this code to pre-process the audio files, if you haven't done so yet.

Kind regards,

martindisley commented 2 years ago

Hi Giacomo,

Thanks for your reply. I think I sorted it. I swapped the dimensions of the input tensor around. It's now:

test_wav = torch.tensor(test_wav).reshape(2,601,257).float().unsqueeze(0).to(device)

instead of:

test_wav = torch.tensor(test_wav).reshape(2,257,601).float().unsqueeze(0).to(device)

That seemed to work and I was able to produce a face. Unfortunately, our results haven't really improved upon yours. It's still only producing an averaging of the dataset.

I will continue to train both the encoder and decoder and see if the results improve. Happy to share the models with you if it starts to produce some interesting results. Thanks for all your help!

gpriamo commented 2 years ago

Hi Martin,

I'm really glad you managed to get this project working. Please keep us posted if you succeed in getting better results than those we got!

Good luck!

antoniomuso / speech2face

Error running Speech2Face.ipynb #1