How to change KP_detector and dense_motion parameters to train on Higher resolution?

stark-akib commented 4 years ago

Hello @AliaksandrSiarohin . First of all, congratulations on the great work and thank you for sharing the repository.

I'm planning to train the model to generate higher resolution output (such as 512x512, 1024x1024). I would really appreciate your insight on my approach.

You mentioned here #14

Currently keypoint detector and dense-motion net operate on 64x64 images

Do I need to change this behavior for better motion transfer performance (while training on higher resolution)? How would you suggest doing it?

Looking forward to hearing from you. :)

AliaksandrSiarohin commented 4 years ago

Hi @stark-akib, I don't have a receipt here, you should try and see for yourself. I would try first to use 64x64 resolution for keypoint detector and dense-motion (use scale_factor = 0.125 and 0.0625 for resolution 512 and 1024 respectively), in case you will see that keypoints is not accurate increase the resolution for keypoint detector, in case you will see that deformations need to be more precise increase resolution for dense motion.

stark-akib commented 4 years ago

Thank you @AliaksandrSiarohin for the direction. How can I increase the resolution for the Keypoint detector & Dense motion model?

For say, If I want to increase the Keypoint detector's resolution to 256x256, do I only change the scale_factor to 0.5 for 512x512 resolution input? Or do I need to change any other parameters, functions or files?

AliaksandrSiarohin commented 4 years ago

Yes just change scale_factor

stark-akib commented 4 years ago

Great. Thank you.

Another quick question, I want to preprocess both VoxCeleb1 and VoxCeleb2 dataset. As you have mentioned in the Video_preprocessing page

Note .png format take aproximatly 300GB.

Does VoxCeleb1 require approx 300GB space for preprocessing? Then how much space will VoxCeleb2 require (as it has more data than VoxCeleb1) for preprocessing?

AliaksandrSiarohin commented 4 years ago

No idea, never download it entirely.

stark-akib commented 4 years ago

Okay. Thank you again.

stark-akib commented 4 years ago

Hello @AliaksandrSiarohin

I'm gonna start the training on VoxCeleb1 at 512x512. As you mentioned here I'm looking for a similar training time as I have 4 NVIDIA Tesla V100 GPUs.

Can you help me specify how much storage will be needed to complete the training process?
Will 1-2TB storage suffice(considering the intermediate files generated while training)?

Also, when should the training terminate? Is 1000 epoch is enough (as stated in the YAML file)?

AliaksandrSiarohin commented 4 years ago

Vox celeb in png format is 300Gb, 300Gb x 4 is 1200Gb. Intermediate files consume less than several Gb. 1000 epochs? Guess should be 100.

stark-akib commented 4 years ago

Great. I'll change the parameters accordingly. Thank you.

newExplore-hash commented 4 years ago

@AliaksandrSiarohin hi, for VoxCeleb Dataset if i want to replace yours KP_detector with existing keypoint detector, such as dlib, what should i do? i have no idea how to handle jacobian_map.

stark-akib commented 4 years ago

@AliaksandrSiarohin

Hello, Just giving an update on the Voxceleb1 preprocessing. The 512x512 preprocessing of VoxCeleb1 took around 870GB space for .png format. So, the required storage for training would be 900GB x 4 = 3.6 TB.

AliaksandrSiarohin commented 4 years ago

Why? Training don't need additional space. x4 was an estimate for 512x512 space occupancy. Because 512 image is roughly 4 times larger.

stark-akib commented 4 years ago

Sorry, I mistook it as a parallel multiplier. The preprocessing I have performed contains 18,671 folders in "train" and 510 folders in "test" folder. The rest of the videos are either showing a broken link or skipped message in the console. I guess no additional space is needed to train then. Thank you.

AliaksandrSiarohin commented 4 years ago

I guess you may need to filter out low resolution videos. So to create vox-metadata.csv I used all the videos where size of the bbox was greater than 256. You can infer size of the bbox from bbox parameter in vox-metadata.csv.

stark-akib commented 4 years ago

Thank you for the tip. I'll have a look.

stark-akib commented 4 years ago

@AliaksandrSiarohin Just a quick question. What's the difference between "vox-adv-256.yaml" and "vox-256.yaml"? The parameter such as use_kp: True and sn: True what is the difference?

Also, what's the use of epoch_milestones: [60, 90]?

AliaksandrSiarohin commented 4 years ago

1) vox-adv is with adversarial loss 2) use_kp add key-points heatmaps to discriminator. 3) sn - spectral normalization 4) epoch_milestones is epochs at which learning rate dropped.

stark-akib commented 4 years ago

Thank you. Which one you would suggest using as the config file? "vox-adv-256.yaml" and "vox-256.yaml"? ( Considering that I will only change the frame_shape and scale factors for 512x512)

AliaksandrSiarohin commented 4 years ago

Without adversarial it is more stable.

stark-akib commented 4 years ago

Thank you for your insight.

stark-akib commented 4 years ago

Hello @AliaksandrSiarohin ,

I've started the training using the following command. CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py --config config/vox-adv-512.yaml --device_ids 0,1,2,3

After 15 seconds, this error is occurring. Can you help me find the problem? I'm using batch size 40, but still getting the OOM error.

run.py:40: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. config = yaml.load(f) Use predefined train-test split. Training... 0%| | 0/150 [00:00<?, ?it/s]/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/functional.py:2423: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. "See the documentation of nn.Upsample for details.".format(mode)) /home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/functional.py:1332: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.") Traceback (most recent call last): File "run.py", line 81, in train(config, generator, discriminator, kp_detector, opt.checkpoint, log_dir, dataset, opt.device_ids) File "/home/ubuntu/Downloads/first-order-model/train.py", line 51, in train losses_generator, generated = generator_full(x) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/ubuntu/Downloads/first-order-model/modules/model.py", line 166, in forward x_vgg = self.vgg(pyramidegenerated['prediction' + str(scale)]) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/ubuntu/Downloads/first-order-model/modules/model.py", line 45, in forward h_relu2 = self.slice2(h_relu1) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/ubuntu/anaconda3/envs/alethea/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 320, in forward self.padding, self.dilation, self.groups) RuntimeError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 15.78 GiB total capacity; 14.45 GiB already allocated; 21.88 MiB free; 148.08 MiB cached)

AliaksandrSiarohin commented 4 years ago

40 is too large. Batch size should be approximately 4 times smaller than for 256, e.g. 16 or 12.

stark-akib commented 4 years ago

Thank you. Lowering the batch size to 16 solved the issue. Just showing those UserWarnings. What's the expected output of the command prompt? Is it supposed to stay like this? There is no output in the log.txt. state1

stark-akib commented 4 years ago

@AliaksandrSiarohin Checked again after leaving it for an hour, it's still stuck here. When closing by "Ctrl + C" the output of the console is like this.

state2

Also, I've set the num_epochs: 100, num_repeats: 50 and batch size down to 12, still, the training is stuck here. So, are there any changes needed in the loss values or in model.py?

AliaksandrSiarohin commented 4 years ago

Probably just slow. You can try to change num_repeats to 1 to see. Also you may want to start with pretrained 256 checkpoint to accelerate convergence.

stark-akib commented 4 years ago

@AliaksandrSiarohin Thank you. Changing the number of repeat to 1 seems to work. The log file is showing loss. As your YAML files suggest, For Voxceleb, 256x256, num_epochs: 100, num-repeats: 75 For Voxceleb adv, 256x256, num_epochs: 150, num-repeats: 75 Then what should be thenum_epochs:' & 'num-repeats for 512x512?

AliaksandrSiarohin commented 4 years ago

Depends on how much training you can afford. The more the better.

stark-akib commented 4 years ago

I can train up to 5 days on my setup, what num_epoch and num_repeats will you suggest? (Considering the 4 NVIDIA Tesla V100 GPUs I've mentioned earlier, I can add another 4. So total 8 V100 GPUs)

AliaksandrSiarohin commented 4 years ago

Since you tried it you should know better how much time it takes.

stark-akib commented 4 years ago

Thank you. For num_epoch: 100 andnum_repeat: 1, the training time is showing around 8 hours. For num_epoch: 100 andnum_repeat: 2, the training time is showing around 16 hours. If I increase the num_epochs, and keep the num_repeats to 1, will that decrease the overall quality of the model?

AliaksandrSiarohin commented 4 years ago

The training has a heavy logging step at the end of each epoch. So you should increase num_repeats, which is the same but heavy logging will be less frequent.

stark-akib commented 4 years ago

Thank you. Much appreciated.

stark-akib commented 4 years ago

Hello @AliaksandrSiarohin The initial training is done. I'm giving you the loss values of the last 50 epoch here. Have a look.

00000550) perceptual - 78.22953; equivariance_value - 0.13433; equivariance_jacobian - 0.29210 00000551) perceptual - 78.71490; equivariance_value - 0.13215; equivariance_jacobian - 0.29084 00000552) perceptual - 78.40482; equivariance_value - 0.13243; equivariance_jacobian - 0.29129 00000553) perceptual - 78.15618; equivariance_value - 0.13266; equivariance_jacobian - 0.28918 00000554) perceptual - 78.71937; equivariance_value - 0.13573; equivariance_jacobian - 0.29312 00000555) perceptual - 78.33852; equivariance_value - 0.13359; equivariance_jacobian - 0.29427 00000556) perceptual - 78.19413; equivariance_value - 0.13334; equivariance_jacobian - 0.29304 00000557) perceptual - 78.62805; equivariance_value - 0.13282; equivariance_jacobian - 0.29052 00000558) perceptual - 78.48531; equivariance_value - 0.13281; equivariance_jacobian - 0.29182 00000559) perceptual - 78.27769; equivariance_value - 0.13198; equivariance_jacobian - 0.29044 00000560) perceptual - 78.12610; equivariance_value - 0.13354; equivariance_jacobian - 0.29344 00000561) perceptual - 78.34229; equivariance_value - 0.13357; equivariance_jacobian - 0.29288 00000562) perceptual - 78.53149; equivariance_value - 0.13338; equivariance_jacobian - 0.29220 00000563) perceptual - 78.24954; equivariance_value - 0.13348; equivariance_jacobian - 0.29223 00000564) perceptual - 77.99688; equivariance_value - 0.13247; equivariance_jacobian - 0.28953 00000565) perceptual - 78.31819; equivariance_value - 0.13474; equivariance_jacobian - 0.29532 00000566) perceptual - 77.78032; equivariance_value - 0.13105; equivariance_jacobian - 0.29143 00000567) perceptual - 78.26097; equivariance_value - 0.13302; equivariance_jacobian - 0.29194 00000568) perceptual - 78.08060; equivariance_value - 0.13312; equivariance_jacobian - 0.29087 00000569) perceptual - 78.31612; equivariance_value - 0.13216; equivariance_jacobian - 0.29115 00000570) perceptual - 77.85737; equivariance_value - 0.13343; equivariance_jacobian - 0.29188 00000571) perceptual - 78.43906; equivariance_value - 0.13288; equivariance_jacobian - 0.28999 00000572) perceptual - 78.00404; equivariance_value - 0.13278; equivariance_jacobian - 0.29044 00000573) perceptual - 78.19481; equivariance_value - 0.13296; equivariance_jacobian - 0.28905 00000574) perceptual - 77.97575; equivariance_value - 0.13345; equivariance_jacobian - 0.29212 00000575) perceptual - 78.32430; equivariance_value - 0.13560; equivariance_jacobian - 0.30891 00000576) perceptual - 78.20422; equivariance_value - 0.13349; equivariance_jacobian - 0.29445 00000577) perceptual - 78.32574; equivariance_value - 0.13406; equivariance_jacobian - 0.29403 00000578) perceptual - 78.44213; equivariance_value - 0.12938; equivariance_jacobian - 0.28732 00000579) perceptual - 78.29219; equivariance_value - 0.13178; equivariance_jacobian - 0.28893 00000580) perceptual - 78.58249; equivariance_value - 0.13386; equivariance_jacobian - 0.29138 00000581) perceptual - 78.32773; equivariance_value - 0.13138; equivariance_jacobian - 0.29096 00000582) perceptual - 78.15102; equivariance_value - 0.13165; equivariance_jacobian - 0.29018 00000583) perceptual - 77.73994; equivariance_value - 0.13441; equivariance_jacobian - 0.29200 00000584) perceptual - 78.30541; equivariance_value - 0.13335; equivariance_jacobian - 0.29168 00000585) perceptual - 79.32146; equivariance_value - 0.13283; equivariance_jacobian - 0.29418 00000586) perceptual - 79.56306; equivariance_value - 0.13031; equivariance_jacobian - 0.29171 00000587) perceptual - 77.85537; equivariance_value - 0.13170; equivariance_jacobian - 0.29010 00000588) perceptual - 77.66492; equivariance_value - 0.13067; equivariance_jacobian - 0.28635 00000589) perceptual - 78.30695; equivariance_value - 0.12989; equivariance_jacobian - 0.28655 00000590) perceptual - 77.96008; equivariance_value - 0.13421; equivariance_jacobian - 0.29230 00000591) perceptual - 80.67488; equivariance_value - 0.17630; equivariance_jacobian - 0.36945 00000592) perceptual - 81.46362; equivariance_value - 0.20148; equivariance_jacobian - 0.40617 00000593) perceptual - 80.43700; equivariance_value - 0.18825; equivariance_jacobian - 0.37400 00000594) perceptual - 79.43629; equivariance_value - 0.18212; equivariance_jacobian - 0.35786 00000595) perceptual - 79.71420; equivariance_value - 0.22560; equivariance_jacobian - 0.36102 00000596) perceptual - 79.67820; equivariance_value - 0.19485; equivariance_jacobian - 0.34437 00000597) perceptual - 79.28871; equivariance_value - 0.18650; equivariance_jacobian - 0.34308 00000598) perceptual - 78.78998; equivariance_value - 0.18235; equivariance_jacobian - 0.33853 00000599) perceptual - 78.96037; equivariance_value - 0.17132; equivariance_jacobian - 0.33194

out2

There are certain artifacts showing around the ears and also in the teeth area (Sorry for the low quality of conversion from mp4 to gif). You can access the actual video output from [here].(https://drive.google.com/file/d/1Xjk9ajbsEc3OPEIhoDU-u1OYamgp9U3E/view?usp=sharing)

1.What should be the ideal value of perceptual - ; equivariance_value - ; equivariance_jacobian -? 2.Will more training improve the current quality? 3.What would you suggest based on the loss values of the training epochs?

AliaksandrSiarohin commented 4 years ago

Hi, the lower the better. For 256 I have higher values, so don't know.

AliaksandrSiarohin commented 4 years ago

I see some weird warping artifacts near the ears, so you may need to increase the resolution of dense_motion to 128.

stark-akib commented 4 years ago

Thank you for having a look at the result. I've already increased the resolution of KP_detector and Dense motion to 128 from the beginning of the training. Should I increase it to 256 or 512 to reduce the artifacts? Can you please share the loss values of your 256 model checkpoint which is available in the google drive folder?

AliaksandrSiarohin commented 4 years ago

No I guess further increase will not help. I don't have this logs anymore. I remember that perceptual loss was about 82.

AliaksandrSiarohin commented 4 years ago

You can also try to finetune with adversarial loss.

stark-akib commented 4 years ago

Thank you. I'll give a try on that.

tgohblio commented 4 years ago

Thank you. I'll give a try on that.

Can you share your checkpoints/model weights once your training is successful? Thanks in advance. :)

Erdos001 commented 4 years ago

Thank you. I'll give a try on that.

@stark-akib hi, have you tried the 512 resolution and do you resolve the teech artifcats ?

pisutonc commented 4 years ago

@stark-akib Hi, could you please share the config file for the 512 resolution training?

MitalPattani commented 4 years ago

@stark-akib can you please share the checkpoints or the config file? thanks

alessiapacca commented 4 years ago

Hey @AliaksandrSiarohin

I re-trained the net for 512. the script https://github.com/AliaksandrSiarohin/video-preprocessing/blob/master/crop_vox.py in my case was giving many errors and not working, so I just took the https://github.com/AliaksandrSiarohin/video-preprocessing/blob/master/vox-metadata.csv and selected the ones >= 512x512. In this case, I had 5827 mp4 videos in the train folder, and 166 mp4 videos int he test folder.

I performed 100 epochs with num_repeats = 20, and the result is not extremely good: biden

biden2

In the training, I increased the resolution of KP_detector and Dense motion to 256 (by using scale_factor 0.5). Do you think the cause of the flickering and of the artifacts is:

too less num_repeats
too small dataset
mp4 dataset (instead of png) ?

the losses at the last epoch are: 00000099) perceptual - 95.23073; equivariance_value - 0.12323; equivariance_jacobian - 0.33881

AliaksandrSiarohin commented 4 years ago

I guess the problem is high resolution for KP_detector and Dense motion, have you tried scale_factor: 0.125 and maybe even take a pretrained dense motion and KP_detector?

alessiapacca commented 4 years ago

@AliaksandrSiarohin Oh, I used 0.5 cause i read on this issue you were suggesting to increase the resolution for the keypoint detector and dense-motion. So you think 0.125 would help more? What do you mean by taking a pretrained dense motion and KP detector?

AliaksandrSiarohin commented 4 years ago

I don't know you should try. Initialize them with weight from my checkpoint.

alessiapacca commented 4 years ago

If I try to start the training from your weights, it gives me the error:

RuntimeError: Error(s) in loading state_dict for OcclusionAwareGenerator:
        size mismatch for dense_motion_network.down.weight: copying a param with shape torch.Size([3, 1, 13, 13]) from checkpoint, the shape in current model is torch.Size([3, 1, 29, 29]).

probably because I am using scale_factor = 0.125 instead you used 0.25 (as you trained for resolution 256, instead I am training for resolution 512)?

alessiapacca commented 4 years ago

100 epochs, 20 num_repeats, scale_factor 0.125 for both dense motion and kp_detector and this is the result

do you think the dataset is too small? @AliaksandrSiarohin or training with mp4 gives worse results? or I am doing something else in the wrong way? It doesn't even move the mouth or close the eyes

result

AliaksandrSiarohin commented 4 years ago

Well hard to say based on a single photo. Hardset sigma in AntialiasingInterpolation and try with pretrained.

AliaksandrSiarohin / first-order-model

How to change KP_detector and dense_motion parameters to train on Higher resolution? #81