Open martindisley opened 3 years ago
Hi Martin,
I'm Antonio's colleague, we developed this repository together for a university exam (which we took one year ago) and, unfortunately, we have not worked on it ever since. For this reason, both of us remember very little about this repo and, due to lack of time, we also left it in a very messy state (sorry!).
I tried to squeeze my memories as much as I could to provide any help, these are some tips I hope you find helpful:
Hopefully these tips will help you overcome some of the issues you may encounter while trying to train the models by yourself.
Good luck!
Hi Giacomo,
Thanks for getting back to me. This is really helpful. I've got a couple of further questions:
PICS_PER_ACTOR
set to 40
. In the decoder notebook you use face_features_10_per_actor.zip
. Did you find you got better results with PICS_PER_ACTOR
set to 10
? Should I set it to that in the preprocessing notebook?Hi Martin,
10
because the models needed a significantly long time to train and we only had Colab base version to train them (meaning very long training times and frequent stops while training), so we had to reduce the number of training samples. Sadly this produced unsatisfying results, so my suggestion is: train the models with as many samples as possible and give it all the time it needs.Hi Giacomo,
Thanks for your help. Think I managed to run the preprocessing notebook alright. I'm now looking at training the decoder.
Do you know what version of python you were running for this? I got a conflict trying to install tensorflow-gpu=1.15
inside a Python 3.9
env. I created another 2.7 env and was able to install it fine but ran into another error:
<ipython-input-4-af5e6643b47e> in __init__(self, data_list_path, size)
25 class EmbedImagePairs(Dataset):
26 def __init__(self, data_list_path, size=64):
---> 27 super().__init__()
28 self.face_features = np.load(join(data_list_path,'facefeature.npy'))
29 self.image_path = join(data_list_path, 'Faces')
TypeError: super() takes at least 1 argument (0 given)
According to this super()
requires at least one argument if before Python 3.0
do you know which version of Python >3.0
works with the required version of tensorflow-gpu
.
Cheers!
Hi Martin,
I believe we used python 3.7 (or even 3.6), but I am not certain.
Be aware that we developed this project during Summer 2020, hopefully this information can help you in narrowing your search for the correct version.
Hi Giacomo,
Thank you for your patience thus far. Got the decoder training! I've got a couple more questions if you've got time:
vgg_face_dag.pth
a pre-trained encoder? Or do I need to train a separate encoder? Cheers
Hi Martin,
Let us know how it goes!
Hi Giacomo,
I'm trying to recreate the wav_filtered_20_per_actor
dir that goes into Speech2Face_newDataset.ipynb
here:
data = Speech2FaceDataset('wav_filtered_20_per_actor', 'Face_Feature','vox1_meta.csv')
data_test = Speech2FaceDataset('wav', 'Face_Feature','vox1_meta.csv')
It's the only thing I'm missing and I hope I'd be able to use dataset_filtering.py
to get it but you refer to the undeclared variable main_dir
(here) and it throws an error. I guess, from the following line, that's the dir that lists the actors by name, and sorts them accordingly. I'd thought about using the unzippedFaces
dir instead, as it has the list of actors, but noticed it has far more sub dirs than wav
has. So I haven't used that as I guess it will result in a miss labelling. I presume this is because I've got the full faces dataset and only the test dataset for speech files. I got the speech dataset using this line:
!curl --user voxceleb1912:0s42xuw6 -o "/content/drive/My Drive/Speech2Face/ff/vox.zip"
http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_test_wav.zip
Do you know what main_dir
refers to. If not, do you think I should try and get the full speech dataset and then use the full face dataset as a reference for the actor's names? Did you use the full faces dataset or just the test set?
Hi Martin,
I am sorry but I do not remember much about this part. We definitely used the vox_celeb dataset, I can't tell if we used the full train dataset or the test one though. The same goes for the faces dataset (anyway, I suggest you to try using the train ones to have as many training samples as possible for the reasons I mentioned in my past comments).
As for that main_dir
variable, it looks like it's only used to get the list of actors, since we then get the wav files using this line and the following one, meaning you pass the path to the vox celeb dataset directory using the path_wavs
parameter and then the python script retrieves the wav files by itself. Therefore I suppose that if the unzippedFaces
dir contains a list of actors' names/ids, that's the one you're looking for.
Hi Giacomo,
Thanks for your help so far, I feel like I'm so close to having this working! I believe I have successfully trained both the encoder and the decoder but I've run into an error running the following cell:
# Quick testing -- actual test
enc.eval()
# dec.eval()
dec_w.eval()
test_wav_path = "/home/martin/workspace/Unsound/voxceleb/wav_full/id10001/1zcIwhmdeo4/00001.wav"
test_wav = load_wav(test_wav_path)
test_wav = torch.tensor(test_wav).reshape(2,257,601).float().unsqueeze(0).to(device)
test_wav_path2 = "/home/martin/workspace/Unsound/voxceleb/wav_test/id10270/5r0dWxy17C8/00001.wav"
test_wav2 = load_wav(test_wav_path2)
test_wav2 = torch.tensor(test_wav2).reshape(2,257,601).float().unsqueeze(0).to(device)
print(torch.equal(test_wav, test_wav2))
#print(test_wav, test_wav2)
out = enc(test_wav)
decoded_w = dec_w(out)
out2 = enc(test_wav2)
decoded_w2 = dec_w(out2)
Here's the error:
RuntimeError: Calculated padded input size per channel: (3 x 289). Kernel size: (4 x 4). Kernel size can't be greater than actual input size
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_2678/3578311790.py in <module>
22
23
---> 24 out2 = enc(test_wav2)
25 decoded_w2 = dec_w(out2)
~/anaconda3/envs/speech2face3.7-2/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []
/tmp/ipykernel_2678/904156634.py in forward(self, x)
51 out = self.batch_norm7(self.relu(self.conv7(out)))
52 out = self.batch_norm8(self.relu(self.conv8(out)))
---> 53 out = self.conv9(out)
54 out = self.batch_norm9(self.relu(self.pooling5(out)))
55
~/anaconda3/envs/speech2face3.7-2/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []
~/anaconda3/envs/speech2face3.7-2/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input)
441
442 def forward(self, input: Tensor) -> Tensor:
--> 443 return self._conv_forward(input, self.weight, self.bias)
444
445 class Conv3d(_ConvNd):
~/anaconda3/envs/speech2face3.7-2/lib/python3.7/site-packages/torch/nn/modules/conv.py in _conv_forward(self, input, weight, bias)
438 _pair(0), self.dilation, self.groups)
439 return F.conv2d(input, weight, bias, self.stride,
--> 440 self.padding, self.dilation, self.groups)
441
442 def forward(self, input: Tensor) -> Tensor:
RuntimeError: Calculated padded input size per channel: (3 x 289). Kernel size: (4 x 4). Kernel size can't be greater than actual input size
I've read here that changing the kernal size in the encoder declaration might solve this, but my assumption would be that I'd need to retrain the encoder with this new encoder in order to test? Do you know of another solution?
Thanks again!
Hi Martin,
I am sorry but I do not know how to actually help you with this part. It looks like the problem lies within the second sample (_testwav2), so you might try using another wav file.
Also, try having a look at this code to pre-process the audio files, if you haven't done so yet.
Kind regards,
Hi Giacomo,
Thanks for your reply. I think I sorted it. I swapped the dimensions of the input tensor around. It's now:
test_wav = torch.tensor(test_wav).reshape(2,601,257).float().unsqueeze(0).to(device)
instead of:
test_wav = torch.tensor(test_wav).reshape(2,257,601).float().unsqueeze(0).to(device)
That seemed to work and I was able to produce a face. Unfortunately, our results haven't really improved upon yours. It's still only producing an averaging of the dataset.
I will continue to train both the encoder and decoder and see if the results improve. Happy to share the models with you if it starts to produce some interesting results. Thanks for all your help!
Hi Martin,
I'm really glad you managed to get this project working. Please keep us posted if you succeed in getting better results than those we got!
Good luck!
Hi Antonio,
I've been trying to reproduce your work on Google Colab. I've downloaded all the datasets by adding the following cell:
After that everything runs perfectly until I hit an error at the following cell:
This is the error:
Any help you can give would be much appreciated.