Open stark-akib opened 4 years ago
Hi @stark-akib, I don't have a receipt here, you should try and see for yourself. I would try first to use 64x64 resolution for keypoint detector and dense-motion (use scale_factor = 0.125 and 0.0625 for resolution 512 and 1024 respectively), in case you will see that keypoints is not accurate increase the resolution for keypoint detector, in case you will see that deformations need to be more precise increase resolution for dense motion.
Thank you @AliaksandrSiarohin for the direction. How can I increase the resolution for the Keypoint detector & Dense motion model?
For say, If I want to increase the Keypoint detector's resolution to 256x256, do I only change the scale_factor to 0.5 for 512x512 resolution input? Or do I need to change any other parameters, functions or files?
Yes just change scale_factor
Great. Thank you.
Another quick question, I want to preprocess both VoxCeleb1 and VoxCeleb2 dataset. As you have mentioned in the Video_preprocessing page
Note .png format take aproximatly 300GB.
Does VoxCeleb1 require approx 300GB space for preprocessing? Then how much space will VoxCeleb2 require (as it has more data than VoxCeleb1) for preprocessing?
No idea, never download it entirely.
Okay. Thank you again.
Hello @AliaksandrSiarohin
I'm gonna start the training on VoxCeleb1 at 512x512. As you mentioned here I'm looking for a similar training time as I have 4 NVIDIA Tesla V100 GPUs.
Also, when should the training terminate? Is 1000 epoch is enough (as stated in the YAML file)?
Vox celeb in png format is 300Gb, 300Gb x 4 is 1200Gb. Intermediate files consume less than several Gb. 1000 epochs? Guess should be 100.
Great. I'll change the parameters accordingly. Thank you.
@AliaksandrSiarohin hi, for VoxCeleb Dataset if i want to replace yours KP_detector with existing keypoint detector, such as dlib, what should i do? i have no idea how to handle jacobian_map.
@AliaksandrSiarohin
Hello, Just giving an update on the Voxceleb1 preprocessing. The 512x512 preprocessing of VoxCeleb1 took around 870GB space for .png format. So, the required storage for training would be 900GB x 4 = 3.6 TB.
Why? Training don't need additional space. x4 was an estimate for 512x512 space occupancy. Because 512 image is roughly 4 times larger.
Sorry, I mistook it as a parallel multiplier. The preprocessing I have performed contains 18,671 folders in "train" and 510 folders in "test" folder. The rest of the videos are either showing a broken link or skipped message in the console. I guess no additional space is needed to train then. Thank you.
I guess you may need to filter out low resolution videos. So to create vox-metadata.csv I used all the videos where size of the bbox was greater than 256. You can infer size of the bbox from bbox parameter in vox-metadata.csv.
Thank you for the tip. I'll have a look.
@AliaksandrSiarohin
Just a quick question. What's the difference between "vox-adv-256.yaml" and "vox-256.yaml"?
The parameter such as
use_kp: True
and
sn: True
what is the difference?
Also, what's the use of epoch_milestones: [60, 90]
?
1) vox-adv is with adversarial loss 2) use_kp add key-points heatmaps to discriminator. 3) sn - spectral normalization 4) epoch_milestones is epochs at which learning rate dropped.
Thank you. Which one you would suggest using as the config file? "vox-adv-256.yaml" and "vox-256.yaml"? ( Considering that I will only change the frame_shape and scale factors for 512x512)
Without adversarial it is more stable.
Thank you for your insight.
Hello @AliaksandrSiarohin ,
I've started the training using the following command.
CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py --config config/vox-adv-512.yaml --device_ids 0,1,2,3
After 15 seconds, this error is occurring. Can you help me find the problem? I'm using batch size 40, but still getting the OOM error.
run.py:40: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. config = yaml.load(f) Use predefined train-test split. Training... 0%| | 0/150 [00:00<?, ?it/s]/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/functional.py:2423: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. "See the documentation of nn.Upsample for details.".format(mode)) /home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/functional.py:1332: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.") Traceback (most recent call last): File "run.py", line 81, in
train(config, generator, discriminator, kp_detector, opt.checkpoint, log_dir, dataset, opt.device_ids) File "/home/ubuntu/Downloads/first-order-model/train.py", line 51, in train losses_generator, generated = generator_full(x) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/ubuntu/Downloads/first-order-model/modules/model.py", line 166, in forward x_vgg = self.vgg(pyramidegenerated['prediction' + str(scale)]) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/ubuntu/Downloads/first-order-model/modules/model.py", line 45, in forward h_relu2 = self.slice2(h_relu1) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 320, in forward self.padding, self.dilation, self.groups) RuntimeError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 15.78 GiB total capacity; 14.45 GiB already allocated; 21.88 MiB free; 148.08 MiB cached)
40 is too large. Batch size should be approximately 4 times smaller than for 256, e.g. 16 or 12.
Thank you. Lowering the batch size to 16 solved the issue. Just showing those UserWarnings. What's the expected output of the command prompt? Is it supposed to stay like this? There is no output in the log.txt.
@AliaksandrSiarohin Checked again after leaving it for an hour, it's still stuck here. When closing by "Ctrl + C" the output of the console is like this.
Also, I've set the num_epochs: 100
, num_repeats: 50
and batch size down to 12, still, the training is stuck here. So, are there any changes needed in the loss values or in model.py?
Probably just slow. You can try to change num_repeats to 1 to see. Also you may want to start with pretrained 256 checkpoint to accelerate convergence.
@AliaksandrSiarohin Thank you. Changing the number of repeat to 1 seems to work. The log file is showing loss. As your YAML files suggest,
For Voxceleb, 256x256, num_epochs: 100, num-repeats: 75
For Voxceleb adv, 256x256, num_epochs: 150, num-repeats: 75
Then what should be thenum_epochs:' & 'num-repeats
for 512x512?
Depends on how much training you can afford. The more the better.
I can train up to 5 days on my setup, what num_epoch and num_repeats will you suggest? (Considering the 4 NVIDIA Tesla V100 GPUs I've mentioned earlier, I can add another 4. So total 8 V100 GPUs)
Since you tried it you should know better how much time it takes.
Thank you. For num_epoch: 100
andnum_repeat: 1
, the training time is showing around 8 hours.
For num_epoch: 100
andnum_repeat: 2
, the training time is showing around 16 hours.
If I increase the num_epochs, and keep the num_repeats to 1, will that decrease the overall quality of the model?
The training has a heavy logging step at the end of each epoch. So you should increase num_repeats, which is the same but heavy logging will be less frequent.
Thank you. Much appreciated.
Hello @AliaksandrSiarohin The initial training is done. I'm giving you the loss values of the last 50 epoch here. Have a look.
00000550) perceptual - 78.22953; equivariance_value - 0.13433; equivariance_jacobian - 0.29210 00000551) perceptual - 78.71490; equivariance_value - 0.13215; equivariance_jacobian - 0.29084 00000552) perceptual - 78.40482; equivariance_value - 0.13243; equivariance_jacobian - 0.29129 00000553) perceptual - 78.15618; equivariance_value - 0.13266; equivariance_jacobian - 0.28918 00000554) perceptual - 78.71937; equivariance_value - 0.13573; equivariance_jacobian - 0.29312 00000555) perceptual - 78.33852; equivariance_value - 0.13359; equivariance_jacobian - 0.29427 00000556) perceptual - 78.19413; equivariance_value - 0.13334; equivariance_jacobian - 0.29304 00000557) perceptual - 78.62805; equivariance_value - 0.13282; equivariance_jacobian - 0.29052 00000558) perceptual - 78.48531; equivariance_value - 0.13281; equivariance_jacobian - 0.29182 00000559) perceptual - 78.27769; equivariance_value - 0.13198; equivariance_jacobian - 0.29044 00000560) perceptual - 78.12610; equivariance_value - 0.13354; equivariance_jacobian - 0.29344 00000561) perceptual - 78.34229; equivariance_value - 0.13357; equivariance_jacobian - 0.29288 00000562) perceptual - 78.53149; equivariance_value - 0.13338; equivariance_jacobian - 0.29220 00000563) perceptual - 78.24954; equivariance_value - 0.13348; equivariance_jacobian - 0.29223 00000564) perceptual - 77.99688; equivariance_value - 0.13247; equivariance_jacobian - 0.28953 00000565) perceptual - 78.31819; equivariance_value - 0.13474; equivariance_jacobian - 0.29532 00000566) perceptual - 77.78032; equivariance_value - 0.13105; equivariance_jacobian - 0.29143 00000567) perceptual - 78.26097; equivariance_value - 0.13302; equivariance_jacobian - 0.29194 00000568) perceptual - 78.08060; equivariance_value - 0.13312; equivariance_jacobian - 0.29087 00000569) perceptual - 78.31612; equivariance_value - 0.13216; equivariance_jacobian - 0.29115 00000570) perceptual - 77.85737; equivariance_value - 0.13343; equivariance_jacobian - 0.29188 00000571) perceptual - 78.43906; equivariance_value - 0.13288; equivariance_jacobian - 0.28999 00000572) perceptual - 78.00404; equivariance_value - 0.13278; equivariance_jacobian - 0.29044 00000573) perceptual - 78.19481; equivariance_value - 0.13296; equivariance_jacobian - 0.28905 00000574) perceptual - 77.97575; equivariance_value - 0.13345; equivariance_jacobian - 0.29212 00000575) perceptual - 78.32430; equivariance_value - 0.13560; equivariance_jacobian - 0.30891 00000576) perceptual - 78.20422; equivariance_value - 0.13349; equivariance_jacobian - 0.29445 00000577) perceptual - 78.32574; equivariance_value - 0.13406; equivariance_jacobian - 0.29403 00000578) perceptual - 78.44213; equivariance_value - 0.12938; equivariance_jacobian - 0.28732 00000579) perceptual - 78.29219; equivariance_value - 0.13178; equivariance_jacobian - 0.28893 00000580) perceptual - 78.58249; equivariance_value - 0.13386; equivariance_jacobian - 0.29138 00000581) perceptual - 78.32773; equivariance_value - 0.13138; equivariance_jacobian - 0.29096 00000582) perceptual - 78.15102; equivariance_value - 0.13165; equivariance_jacobian - 0.29018 00000583) perceptual - 77.73994; equivariance_value - 0.13441; equivariance_jacobian - 0.29200 00000584) perceptual - 78.30541; equivariance_value - 0.13335; equivariance_jacobian - 0.29168 00000585) perceptual - 79.32146; equivariance_value - 0.13283; equivariance_jacobian - 0.29418 00000586) perceptual - 79.56306; equivariance_value - 0.13031; equivariance_jacobian - 0.29171 00000587) perceptual - 77.85537; equivariance_value - 0.13170; equivariance_jacobian - 0.29010 00000588) perceptual - 77.66492; equivariance_value - 0.13067; equivariance_jacobian - 0.28635 00000589) perceptual - 78.30695; equivariance_value - 0.12989; equivariance_jacobian - 0.28655 00000590) perceptual - 77.96008; equivariance_value - 0.13421; equivariance_jacobian - 0.29230 00000591) perceptual - 80.67488; equivariance_value - 0.17630; equivariance_jacobian - 0.36945 00000592) perceptual - 81.46362; equivariance_value - 0.20148; equivariance_jacobian - 0.40617 00000593) perceptual - 80.43700; equivariance_value - 0.18825; equivariance_jacobian - 0.37400 00000594) perceptual - 79.43629; equivariance_value - 0.18212; equivariance_jacobian - 0.35786 00000595) perceptual - 79.71420; equivariance_value - 0.22560; equivariance_jacobian - 0.36102 00000596) perceptual - 79.67820; equivariance_value - 0.19485; equivariance_jacobian - 0.34437 00000597) perceptual - 79.28871; equivariance_value - 0.18650; equivariance_jacobian - 0.34308 00000598) perceptual - 78.78998; equivariance_value - 0.18235; equivariance_jacobian - 0.33853 00000599) perceptual - 78.96037; equivariance_value - 0.17132; equivariance_jacobian - 0.33194
There are certain artifacts showing around the ears and also in the teeth area (Sorry for the low quality of conversion from mp4 to gif). You can access the actual video output from [here].(https://drive.google.com/file/d/1Xjk9ajbsEc3OPEIhoDU-u1OYamgp9U3E/view?usp=sharing)
1.What should be the ideal value of perceptual - ; equivariance_value - ; equivariance_jacobian -
?
2.Will more training improve the current quality?
3.What would you suggest based on the loss values of the training epochs?
Hi, the lower the better. For 256 I have higher values, so don't know.
I see some weird warping artifacts near the ears, so you may need to increase the resolution of dense_motion to 128.
Thank you for having a look at the result. I've already increased the resolution of KP_detector and Dense motion to 128 from the beginning of the training. Should I increase it to 256 or 512 to reduce the artifacts? Can you please share the loss values of your 256 model checkpoint which is available in the google drive folder?
No I guess further increase will not help. I don't have this logs anymore. I remember that perceptual loss was about 82.
You can also try to finetune with adversarial loss.
Thank you. I'll give a try on that.
Thank you. I'll give a try on that.
Can you share your checkpoints/model weights once your training is successful? Thanks in advance. :)
Thank you. I'll give a try on that.
@stark-akib hi, have you tried the 512 resolution and do you resolve the teech artifcats ?
@stark-akib Hi, could you please share the config file for the 512 resolution training?
@stark-akib can you please share the checkpoints or the config file? thanks
Hey @AliaksandrSiarohin
I re-trained the net for 512. the script https://github.com/AliaksandrSiarohin/video-preprocessing/blob/master/crop_vox.py in my case was giving many errors and not working, so I just took the https://github.com/AliaksandrSiarohin/video-preprocessing/blob/master/vox-metadata.csv and selected the ones >= 512x512. In this case, I had 5827 mp4 videos in the train folder, and 166 mp4 videos int he test folder.
I performed 100 epochs with num_repeats = 20, and the result is not extremely good:
In the training, I increased the resolution of KP_detector and Dense motion to 256 (by using scale_factor 0.5). Do you think the cause of the flickering and of the artifacts is:
the losses at the last epoch are:
00000099) perceptual - 95.23073; equivariance_value - 0.12323; equivariance_jacobian - 0.33881
I guess the problem is high resolution for KP_detector and Dense motion, have you tried scale_factor: 0.125 and maybe even take a pretrained dense motion and KP_detector?
@AliaksandrSiarohin Oh, I used 0.5 cause i read on this issue you were suggesting to increase the resolution for the keypoint detector and dense-motion. So you think 0.125 would help more? What do you mean by taking a pretrained dense motion and KP detector?
I don't know you should try. Initialize them with weight from my checkpoint.
If I try to start the training from your weights, it gives me the error:
RuntimeError: Error(s) in loading state_dict for OcclusionAwareGenerator:
size mismatch for dense_motion_network.down.weight: copying a param with shape torch.Size([3, 1, 13, 13]) from checkpoint, the shape in current model is torch.Size([3, 1, 29, 29]).
probably because I am using scale_factor = 0.125 instead you used 0.25 (as you trained for resolution 256, instead I am training for resolution 512)?
100 epochs, 20 num_repeats, scale_factor 0.125 for both dense motion and kp_detector and this is the result
do you think the dataset is too small? @AliaksandrSiarohin or training with mp4 gives worse results? or I am doing something else in the wrong way? It doesn't even move the mouth or close the eyes
Well hard to say based on a single photo. Hardset sigma in AntialiasingInterpolation and try with pretrained.
Hello @AliaksandrSiarohin . First of all, congratulations on the great work and thank you for sharing the repository.
I'm planning to train the model to generate higher resolution output (such as 512x512, 1024x1024). I would really appreciate your insight on my approach.
You mentioned here #14
Do I need to change this behavior for better motion transfer performance (while training on higher resolution)? How would you suggest doing it?
Looking forward to hearing from you. :)