AliaksandrSiarohin / first-order-model

This repository contains the source code for the paper First Order Motion Model for Image Animation
https://aliaksandrsiarohin.github.io/first-order-model-website/
MIT License
14.55k stars 3.22k forks source link

Does not support high-resolution images #20

Open ghost opened 4 years ago

ghost commented 4 years ago

Is there a way to support high resolution

AliaksandrSiarohin commented 4 years ago

1) The only reliable methods is to retrain on high resolution videos. 2) You can also try to use of the shell video super-resolution method. 3) Since all the networks are fully convolutional you can actually try to use pretrained checkpoints , trained on 256 images. In order to do this change the size in https://github.com/AliaksandrSiarohin/first-order-model/blob/2ed57e0e7825717a966ea9eca95e7abd61edd78f/demo.py#L121 to size that you want. Also it may be benificial to change scale_factor parameter in config https://github.com/AliaksandrSiarohin/first-order-model/blob/2ed57e0e7825717a966ea9eca95e7abd61edd78f/config/vox-256.yaml#L26 and in https://github.com/AliaksandrSiarohin/first-order-model/blob/2ed57e0e7825717a966ea9eca95e7abd61edd78f/config/vox-256.yaml#L38. For example if you want 512 resolution images change it to 0.125, so that input resolution for these networks is always 64.

If you have any lack with these please share your findings.

5agado commented 4 years ago

@AliaksandrSiarohin thanks for the feedback.

Notice however that point 3 doesn't work out-of-the-box. If I change the scale factors as you mention I get an error for incompatible shapes.

Also, as I'm planning to try out some super-resolution methods for this, I'm curious about what you mean with "shell video super-resolution method"?

AliaksandrSiarohin commented 4 years ago

Can you post the error message you got? I mean some video super resolution method, like one there https://paperswithcode.com/task/video-super-resolution

5agado commented 4 years ago

@AliaksandrSiarohin

Error(s) in loading state_dict for OcclusionAwareGenerator:
    size mismatch for dense_motion_network.down.weight: copying a param with shape torch.Size([3, 1, 13, 13]) from checkpoint, the shape in current model is torch.Size([3, 1, 29, 29]).
AliaksandrSiarohin commented 4 years ago

Ah yes you are right. Can you try in https://github.com/AliaksandrSiarohin/first-order-model/blob/2ed57e0e7825717a966ea9eca95e7abd61edd78f/modules/util.py#L205 to hard set sigma=1.5?

5agado commented 4 years ago

Cool, that worked! Could it be generalized for other resolutions? I'll do some tests and comparisons using super-resolution

AliaksandrSiarohin commented 4 years ago

What do you mean? Generalized?

5agado commented 4 years ago

Is the scale factor proportional to image size? Like if I wanted to try with 1024x1024 I should use scale_factor = 0.0625?

Also is the fixed sigma (1.5) valid only for size 512? What about for size 1024?

I was interested in generalizing my setup such that these values can be derived automatically by the given image size.

AliaksandrSiarohin commented 4 years ago

Yes you should use scale_factor = 0.0625. In other words kp_detector and dense_motion should always operate on the same 64x64 resolution. This sigma is parameter of anti-aliasing for downsampling, in principle any could be used, I select the one which is used by default in scikit-image. So sigma=1.5 is default for 256x256. But I don't think it affect results that much. So you can leave it equal to 1.5 or you can avoid loading this dense_motion_network.down.weight parameter, by removing it from state_dict.

5agado commented 4 years ago

Thanks so much for the support, really valuable info here!

CarolinGao commented 4 years ago

Hi ,have you retrained on high resolution videos? If i do not retrain on new datasets, instead just do as the point3 mentioned, can I get a good result?

AliaksandrSiarohin commented 4 years ago

See https://github.com/tg-bomze/Face-Image-Motion-Model for point2

LopsidedJoaw commented 4 years ago

@AliaksandrSiarohin @5agado I have run some tests using the method detailed in point 2.

Generally the result looks like this:

ezgif-1-3f05db10770d

It would be good to get your thoughts on whether this an issue of using a checkpoint trained on 256 x 256 images, or if I am doing something wrong...

Many thanks for your excellent work.

pidginred commented 4 years ago

@AliaksandrSiarohin

sigma=1.5 does not work for 1024x1024 source images (with scale factor of 0.0625). I get the following error:

  File "C:\Users\admin\git\first-order-model\modules\util.py", line 180, in forward
    out = torch.cat([out, skip], dim=1)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 1 and 2 in dimension 2 at c:\a\w\1\s\tmp_conda_3.6_061433\conda\conda-bld\pytorch_1544163532679\work\aten\src\thc\generic/THCTensorMath.cu:83

But I can confirm that hard coding sigma=1.5 works only for 512x512 images (with scale factor of 0.125).

Can you please let us know the correct setting for 1024x1024 images? Thank you for your wonderful work.

AliaksandrSiarohin commented 4 years ago

@pidginred can you provide full stack trace, and your configs.

pidginred commented 4 years ago

@AliaksandrSiarohin Certainly! Here are the changes I made (for 1024x1024 / 0.0625) & the full error stack:

Diffs

diff --git a/config/vox-256.yaml b/config/vox-256.yaml
index abfe9a2..10fce42 100644
--- a/config/vox-256.yaml
+++ b/config/vox-256.yaml
@@ -23,7 +23,7 @@ model_params:
      temperature: 0.1
      block_expansion: 32
      max_features: 1024
-     scale_factor: 0.25
+     scale_factor: 0.0625
      num_blocks: 5
   generator_params:
     block_expansion: 64
@@ -35,7 +35,7 @@ model_params:
       block_expansion: 64
       max_features: 1024
       num_blocks: 5
-      scale_factor: 0.25
+      scale_factor: 0.0625
   discriminator_params:
     scales: [1]
     block_expansion: 32
diff --git a/demo.py b/demo.py
index 848b3df..28bea70 100644
--- a/demo.py
+++ b/demo.py
@@ -134,7 +134,7 @@ if __name__ == "__main__":
     reader.close()
     driving_video = imageio.mimread(opt.driving_video, memtest=False)

-    source_image = resize(source_image, (256, 256))[..., :3]
+    source_image = resize(source_image, (1024, 1024))[..., :3]
     driving_video = [resize(frame, (256, 256))[..., :3] for frame in driving_video]
     generator, kp_detector = load_checkpoints(config_path=opt.config, checkpoint_path=opt.checkpoint, cpu=opt.cpu)

diff --git a/modules/util.py b/modules/util.py
index 8ec1d25..cb8b149 100644
--- a/modules/util.py
+++ b/modules/util.py
@@ -202,7 +202,7 @@ class AntiAliasInterpolation2d(nn.Module):
     """
     def __init__(self, channels, scale):
         super(AntiAliasInterpolation2d, self).__init__()
-        sigma = (1 / scale - 1) / 2
+        sigma = 1.5 # Hard coded as per issues/20#issuecomment-600784060
         kernel_size = 2 * round(sigma * 4) + 1
         self.ka = kernel_size // 2
         self.kb = self.ka - 1 if kernel_size % 2 == 0 else self.ka

Full Errors

(base) C:\Users\admin\git\first-order-model-1024>python demo.py  --config config/vox-256.yaml --driving_video driving.mp4 --source_image source.jpg --checkpoint "C:\Users\admin\Downloads\vox-cpk.pth.tar" --relative --adapt_scale
demo.py:27: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
Traceback (most recent call last):
  File "demo.py", line 150, in <module>
    predictions = make_animation(source_image, driving_video, generator, kp_detector, relative=opt.relative, adapt_movement_scale=opt.adapt_scale, cpu=opt.cpu)
  File "demo.py", line 65, in make_animation
    kp_driving_initial = kp_detector(driving[:, :, 0])
  File "C:\Users\admin\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\admin\Anaconda3\lib\site-packages\torch\nn\parallel\data_parallel.py", line 141, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "C:\Users\admin\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\admin\git\first-order-model-1024\modules\keypoint_detector.py", line 53, in forward
    feature_map = self.predictor(x)
  File "C:\Users\admin\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\admin\git\first-order-model-1024\modules\util.py", line 196, in forward
    return self.decoder(self.encoder(x))
  File "C:\Users\admin\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\admin\git\first-order-model-1024\modules\util.py", line 180, in forward
    out = torch.cat([out, skip], dim=1)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 1 and 2 in dimension 2 at c:\a\w\1\s\tmp_conda_3.6_061433\conda\conda-bld\pytorch_1544163532679\work\aten\src\thc\generic/THCTensorMath.cu:83
eps696 commented 4 years ago

@pidginred fixed sigma worked on my side for any resolution, including 1024x1024. it's not the reason of your problems.

pidginred commented 4 years ago

@eps696 What was your scale factor for 1024x1024? And did you get a proper output?

eps696 commented 4 years ago

@pidginred same as yours, 0.0625. but i also resize driving_video, not only source_image (which i see you don't).

pidginred commented 4 years ago

@eps696 Confirmed that worked. However, I lost almost complete eye & mouth tracking (compared to 256x256), and it results in lots of weird artifacts and very poor quality output.

Are you getting good quality results (in terms of animation) using 1024x1024 compared to 256x256?

eps696 commented 4 years ago

@pidginred i've used it for rather artistic purposes (applying to face-alike imagery), so cannot confirm 100%. it definitely behaved very similar with 1024 and 256 resolutions, though. speaking animation quality, quite a lot was said here about the necessity of having similarity in poses (or face expressions) between the source image and the starting video frame. i think you may want to check that first.

zpeiguo commented 4 years ago

@AliaksandrSiarohin @5agado I have run some tests using the method detailed in point 2.

Generally the result looks like this:

ezgif-1-3f05db10770d

It would be good to get your thoughts on whether this an issue of using a checkpoint trained on 256 x 256 images, or if I am doing something wrong...

Many thanks for your excellent work.

I had the same problem

zpeiguo commented 4 years ago

@eps696 Can you share the revised file?After I followed the above steps, the facial movements were normal, but the mouth could not open.

eps696 commented 4 years ago

@zpeiguo that project is unreleased yet, sorry. and this topic is about high res images. check other issues for 'normality' of movements.

shillerz commented 4 years ago

@eps696 Can you share the revised file? After I followed the above steps, the facial movements were normal, but the mouth could not open.

Same here. Mouth won't open. I believe that the best is to retrain everything with a 512 rez

boraturant commented 4 years ago

@eps696 Confirmed that worked. However, I lost almost complete eye & mouth tracking (compared to 256x256), and it results in lots of weird artifacts and very poor quality output.

Are you getting good quality results (in terms of animation) using 1024x1024 compared to 256x256?

I have also tested with the third method with 512, the animation quality is lower than the 256. I have to judgement as to why, I expect the quality to be the same with the same 64 keypoints.

BloodBlackNothingness commented 4 years ago

I got method 3 working on Windows 10 following the steps above and successfully output a 512 version. However, the results are of much lower quality animation wise. Hoping we can get a 512 or higher checkpoint trained soon.

bigboss97 commented 4 years ago

I got method 3 working on Windows 10 following the steps above and successfully output a 512 version. However, the results are of much lower quality animation wise. Hoping we can get a 512 or higher checkpoint trained soon.

I also followed method 3 and the animation is not acceptable :-( Mouth does not open at all and the face is distorted all the time. Maybe have to use AI to upscale 256 to 512 video :-)

BloodBlackNothingness commented 4 years ago

I got method 3 working on Windows 10 following the steps above and successfully output a 512 version. However, the results are of much lower quality animation wise. Hoping we can get a 512 or higher checkpoint trained soon.

I also followed method 3 and the animation is not acceptable :-( Mouth does not open at all and the face is distorted all the time. Maybe have to use AI to upscale 256 to 512 video :-)

Yes in theory. It depends on the video output quality I suppose. I have tried with Topaz Labs software and it also enhances distortions.

lschaupp commented 4 years ago

@AliaksandrSiarohin @5agado I have run some tests using the method detailed in point 2.

Generally the result looks like this:

ezgif-1-3f05db10770d

It would be good to get your thoughts on whether this an issue of using a checkpoint trained on 256 x 256 images, or if I am doing something wrong...

Many thanks for your excellent work.

Which super resolution network did you end up using? :)

SophistLu commented 4 years ago

I got method 3 working on Windows 10 following the steps above and successfully output a 512 version. However, the results are of much lower quality animation wise. Hoping we can get a 512 or higher checkpoint trained soon.

In demo.py, I try also resizing "driving_video", it works: driving_video = [resize(frame, (512, 512))[..., :3] for frame in driving_video]

bigboss97 commented 4 years ago

In demo.py, I try also resizing "driving_video", it works: driving_video = [resize(frame, (512, 512))[..., :3] for frame in driving_video]

Yes, it ran. But my result (animation) was terrible.

konstiantyn commented 3 years ago

Untitled How can I change blending mask size?

dreammonkey commented 3 years ago

Hi all,

I was wondering if anyone has succeeded in successfully retraining the network to support 512x512 (or higher) images ? Before attempting this my self, I thought it might be a good idea to check if anyone has succeeded in retraining and if yes if that person would be kind enough to provide the checkpoints/configuration with the community ? 🙏

Kind regards

TracelessLe commented 3 years ago

@AliaksandrSiarohin @5agado I have run some tests using the method detailed in point 2.

Generally the result looks like this:

ezgif-1-3f05db10770d

It would be good to get your thoughts on whether this an issue of using a checkpoint trained on 256 x 256 images, or if I am doing something wrong...

Many thanks for your excellent work.

hi @LopsidedJoaw, which super-resolution method did you use to get the 320320 size result from 256256 input as your gif shows?

LopsidedJoaw commented 3 years ago

I used the same method described in the first 10 or so entries on this post.

On 9 Apr 2021, at 08:04, TracelessLe @.***> wrote:

@AliaksandrSiarohin https://github.com/AliaksandrSiarohin @5agado https://github.com/5agado I have run some tests using the method detailed in point 2.

Generally the result looks like this:

https://user-images.githubusercontent.com/37964292/78800976-fda86580-79b3-11ea-866e-6dfe046b6a20.gif It would be good to get your thoughts on whether this an issue of using a checkpoint trained on 256 x 256 images, or if I am doing something wrong...

Many thanks for your excellent work.

hi @LopsidedJoaw https://github.com/LopsidedJoaw, which super-resolution method did you use to get the 320320 size result from 256256 input as your gif shows?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AliaksandrSiarohin/first-order-model/issues/20#issuecomment-816462604, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJBUUBDP6UGKSPLIXO3KDULTH2RH7ANCNFSM4LKIORHA.

TracelessLe commented 3 years ago

I used the same method described in the first 10 or so entries on this post. On 9 Apr 2021, at 08:04, TracelessLe @.***> wrote: @AliaksandrSiarohin https://github.com/AliaksandrSiarohin @5agado https://github.com/5agado I have run some tests using the method detailed in point 2. Generally the result looks like this: https://user-images.githubusercontent.com/37964292/78800976-fda86580-79b3-11ea-866e-6dfe046b6a20.gif It would be good to get your thoughts on whether this an issue of using a checkpoint trained on 256 x 256 images, or if I am doing something wrong... Many thanks for your excellent work. hi @LopsidedJoaw https://github.com/LopsidedJoaw, which super-resolution method did you use to get the 320320 size result from 256256 input as your gif shows? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJBUUBDP6UGKSPLIXO3KDULTH2RH7ANCNFSM4LKIORHA.

Got that, thank you. :)

adeptflax commented 3 years ago

I'm going to train a 512x512 face model and release it to the public under the public domain.

bigboss97 commented 3 years ago

Can't wait. Please also share the process. I think many people are interested. Thanks.

adeptflax commented 3 years ago

I'm going to take 5 days to train on a rtx3090. I'm also going to train a 512x512 motion-cosegmentation model and release to the public as well under the public domain.

LopsidedJoaw commented 3 years ago

Legend

On 14 Apr 2021, at 16:36, adeptflax @.***> wrote:

I'm going to take 5 days to train. I'm also going to train a 512x512 motion-cosegmentation model and release to the public as well under the public domain.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AliaksandrSiarohin/first-order-model/issues/20#issuecomment-819614134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJBUUBDYU54MLEG7JURQUCTTIWZAPANCNFSM4LKIORHA.

adeptflax commented 3 years ago

I need these models for a project I'm working, so I might as well release them to the public.

adeptflax commented 3 years ago

I got it trained I will be uploading it shortly

adeptflax commented 3 years ago

Here it is: https://github.com/adeptflax/motion-models with any additional info you might want to know. I uploaded the model to mediafire. Hopefully that doesn't cause any issues.

bigboss97 commented 3 years ago

@adeptflax Thank you so much for your hard work. I managed to run your 512 version. Just for comparison, here are my old 256 footage and the new 512 version:

https://user-images.githubusercontent.com/34834507/115713593-89234780-a3b9-11eb-8346-8d768ac9a446.mp4

https://user-images.githubusercontent.com/34834507/115713617-92acaf80-a3b9-11eb-80dc-be6bb8eb5a90.mp4

ghost commented 3 years ago

When trying to run the 512 model with this command: python demo.py --config config/vox-512.yaml --driving_video videos/2.mp4 --source_image images/4.jpg --checkpoint checkpoints/first-order-model-checkpoint-94.pth.tar --relative --adapt_scale --cpu I get the following error:

/home/USER/miniconda3/envs/first/lib/python3.7/site-packages/imageio/core/format.py:403: UserWarning: Could not read last frame of /home/USER/General/Creating animated characters/First order motion model/first-order-model/videos/2.mp4.
  warn('Could not read last frame of %s.' % uri)
/home/USER/miniconda3/envs/first/lib/python3.7/site-packages/skimage/transform/_warps.py:105: UserWarning: The default mode, 'constant', will be changed to 'reflect' in skimage 0.15.
  warn("The default mode, 'constant', will be changed to 'reflect' in "
/home/USER/miniconda3/envs/first/lib/python3.7/site-packages/skimage/transform/_warps.py:110: UserWarning: Anti-aliasing will be enabled by default in skimage 0.15 to avoid aliasing artifacts when down-sampling images.
  warn("Anti-aliasing will be enabled by default in skimage 0.15 to "
demo.py:27: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
Traceback (most recent call last):
  File "demo.py", line 144, in <module>
    generator, kp_detector = load_checkpoints(config_path=opt.config, checkpoint_path=opt.checkpoint, cpu=opt.cpu)
  File "demo.py", line 44, in load_checkpoints
    generator.load_state_dict(checkpoint['generator'])
  File "/home/USER/miniconda3/envs/first/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for OcclusionAwareGenerator:
    size mismatch for dense_motion_network.down.weight: copying a param with shape torch.Size([3, 1, 13, 13]) from checkpoint, the shape in current model is torch.Size([3, 1, 29, 29]).

It runs fine with the 256 model. Has anyone run into the same problem or does anyone know how it could be fixed?

Update: I've fixed the problem, I had to change sigma to 1.5 as described here: https://github.com/adeptflax/motion-models https://github.com/AliaksandrSiarohin/first-order-model/issues/20#issuecomment-600784060 (it's also described there how to change 256 to 512 in the demo.py file)

Steps to fix:

  1. in demo.py change everything from 256 to 512 around this line: source_image = resize(source_image, (256, 256))[..., :3]
  2. change sigma to 1.5 in utils.py: sigma = (1 / scale - 1) / 2 to sigma = 1.5
  3. Use videos of 512x512 resolution
shyamjithMC commented 3 years ago

Update: I've fixed the problem, I had to change sigma to 1.5 as described here: https://github.com/adeptflax/motion-models #20 (comment) (it's also described there how to change 256 to 512 in the demo.py file)

When trying to run the 512 model with this command: python demo.py --config config/vox-512.yaml --driving_video videos/2.mp4 --source_image images/4.jpg --checkpoint checkpoints/first-order-model-checkpoint-94.pth.tar --relative --adapt_scale --cpu I get the following error:

/home/USER/miniconda3/envs/first/lib/python3.7/site-packages/imageio/core/format.py:403: UserWarning: Could not read last frame of /home/USER/General/Creating animated characters/First order motion model/first-order-model/videos/2.mp4.
  warn('Could not read last frame of %s.' % uri)
/home/USER/miniconda3/envs/first/lib/python3.7/site-packages/skimage/transform/_warps.py:105: UserWarning: The default mode, 'constant', will be changed to 'reflect' in skimage 0.15.
  warn("The default mode, 'constant', will be changed to 'reflect' in "
/home/USER/miniconda3/envs/first/lib/python3.7/site-packages/skimage/transform/_warps.py:110: UserWarning: Anti-aliasing will be enabled by default in skimage 0.15 to avoid aliasing artifacts when down-sampling images.
  warn("Anti-aliasing will be enabled by default in skimage 0.15 to "
demo.py:27: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
Traceback (most recent call last):
  File "demo.py", line 144, in <module>
    generator, kp_detector = load_checkpoints(config_path=opt.config, checkpoint_path=opt.checkpoint, cpu=opt.cpu)
  File "demo.py", line 44, in load_checkpoints
    generator.load_state_dict(checkpoint['generator'])
  File "/home/USER/miniconda3/envs/first/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for OcclusionAwareGenerator:
  size mismatch for dense_motion_network.down.weight: copying a param with shape torch.Size([3, 1, 13, 13]) from checkpoint, the shape in current model is torch.Size([3, 1, 29, 29]).

It runs fine with the 256 model. Has anyone run into the same problem or does anyone know how it could be fixed?

I have the same issue

william-nz commented 3 years ago

@adeptflax First off, thanks for doing this :)

im having an issue _pickle.UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified. from here File "demo.py", line 42, in load_checkpoints checkpoint = torch.load(checkpoint_path)

I think it has something to do with the file format of the checkpoint? any ideas?

Animan8000 commented 3 years ago

@adeptflax First off, thanks for doing this :)

im having an issue _pickle.UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified. from here File "demo.py", line 42, in load_checkpoints checkpoint = torch.load(checkpoint_path)

I think it has something to do with the file format of the checkpoint? any ideas?

Same error here "_pickle.UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified."

william-nz commented 3 years ago

@bigboss97 did you do anything to the 512 checkpoint from @adeptflax to get it to work?