A few basic question on implementation detail.

mazzzystar commented 4 years ago

Thanks for your work ! Here are some questions:

About formula 4 and your implementation. Is this right for each kind of color of variabel match with the original formula ? In my undestanding, the references frameR means the origin point of K dim space in virtual coordinates, and z denotes "the space coordinate of D", but in your code it's a standard meshgrid, the identity_grid.
About HourGlass I don't see the HourGlass module in your paper, but in your code https://github.com/AliaksandrSiarohin/first-order-model/blob/902f83a4217c75c842e1f536b3331c5032b703a2/modules/dense_motion.py#L15

What does it stand for ? Cause we alrerady have a) the warp module b) the occlusion module.

I will appreciate for your answer ~

AliaksandrSiarohin commented 4 years ago

1) Colors are right. 2) I need to compute the source coordinate for each driving coordinate (z). So identity grid is just contain all the driving coordinates in a grid, e.g [-1,1]x[-1,1]. 3) Hourglass is just a type of architecture. In paper it is called Unet.

mazzzystar commented 4 years ago

Thanks for quick response, I will check the code part of z.

After carefully read your paper, I still cannot understand why use the reference frame R. What's the benefit of using S<-R<-D rather than computing directly S<-D ? Sorry for the stupid question.

AliaksandrSiarohin commented 4 years ago

The lack of paired data make this approach not meaningful. So imagine you train a network to predict a transformation S<-D from [S ¦¦ D], S concatenated with D. At training time you can only use frames from the same video, while at test time you will need to use frames from different videos. The network will likely never generalize to frames from different videos, since it never saw any.

To this end we trying to learn a network that make an independent predictions for S and D. And to define it properly we introduce this R.

mazzzystar commented 4 years ago

If the purpose of introducing R is to prevent "train on same video, test on different", can we just use random [S, D] pair which from different video to train network, without using R ?

Also, currently your model still train D and S from the same video( if I'm not wrong), therefore how to prevent the situation you mentioned ? https://github.com/AliaksandrSiarohin/first-order-model/blob/902f83a4217c75c842e1f536b3331c5032b703a2/frames_dataset.py#L110-L114

AliaksandrSiarohin commented 4 years ago

No this is not the purpose. The purpose is to make independent motion predictions for S, D. If the motion predictions is dependent on each other, e.g. Keypoint predictor use a concatenation of S and D, it won't generalise.

mazzzystar commented 4 years ago

"I need to compute the source coordinate for each driving coordinate (z). So identity grid is just contain all the driving coordinates in a grid, e.g [-1,1]x[-1,1]."

Here I still can't understand the difference between driving coordinate (z) and kp_driving. I think kp_driving means the relative coordinate of K keypoints from R->D, and the driving coordinate means the local pixels around each z(k).

If so, why we can reprersent z with an identity_grid ?

AliaksandrSiarohin commented 4 years ago

Yes true z is local pixel around z_k. However at the point where I produce this sparse motions, I don't know what will be the neiboirhoods. So I compute the transformation for all the coordinates in driving, and select the neiboirhoods afterwards. Note that all possible coordinates in the driving frame that we potentially may need for warping can be produced by identity_grid.

mazzzystar commented 4 years ago

And, if the purpose is tto make independent motion predictions for S, D. Then I think we should optimize D and S independently, such as we optimize (R, S) and (R, D).

But in your code, the R is always used as an intermediate variable for S<-D. we do not add restrict to R, so how can we make sure D and S trained independently ?

AliaksandrSiarohin commented 4 years ago

And, if the purpose is tto make independent motion predictions for S, D. Then I think we should optimize D and S independently, such as we optimize (R, S) and (R, D).

But in your code, the R is always used as an intermediate variable for S<-D. we do not add restrict to R, so how can we make sure D and S trained independently ?

Later only motion information from driving is used. So prediction will be independed of the driving appearance.

Guess it will be easier if you say how you want to implement it, and I will say why this would not work.

mazzzystar commented 4 years ago

OK, for example:

1.Select random frame D0 and S0 which are similar from different videos. 2.For each frame of Di videos, compute keypoints and heatmap, and the sparse motion between(D0, Di). 3.Predict new S from (S0, spare motion) with Di motion and S0 appearance.

Currently I havn't totally understand your paper, so maybe I've gone to a wrong direction.

AliaksandrSiarohin commented 4 years ago

Yes, this would be ideal training scheme. Note that for step 3 you need a ground truth for S. So the training data required is videos (S and D) where 2 object perform exactly the same movements, and it is not possible to find it from in the wild videos.

mazzzystar commented 4 years ago

But why I need full S video? I only use S0 in the whole training phase.

You mean the ground truth S for reconstruction loss, which is the same appearance with S0 and same motion with Di ?

AliaksandrSiarohin commented 4 years ago

But why I need full S video? I only use S0 in the whole training phase.

The network is trained with reconstruction loss, so you will need to compute

|| S - \hat(S)||

Where \hat(S) is ground truth video.

mazzzystar commented 4 years ago

Now I understand. But What if I can warp S0 with sparse motion between D0 and Di in step 2 ? So the warped new S0 can be the ground truth.
Its the similar idea with yours, but without the use of R.

AliaksandrSiarohin commented 4 years ago

What do you mean by align?

mazzzystar commented 4 years ago

sorry, it should be "warp"

AliaksandrSiarohin commented 4 years ago

In that case what will be your training signal for obtaining sparse motions between D0 and Di, in the first place?

mazzzystar commented 4 years ago

I think the sparse motion can be direrctly computed from (kp_Di, kp_D0, D0) ?

AliaksandrSiarohin commented 4 years ago

Yes, but how to obtain kp_Di and Kp_D0?

mazzzystar commented 4 years ago

With pretrained keypoint detecor?

AliaksandrSiarohin commented 4 years ago

Yes, but the whole purpose of the paper is to avoid it. To be able to train on arbitrary objects.

mazzzystar commented 4 years ago

Though until now I didn't recognized that keypoint training is associated with motion module. Before now I thought keypoint training is an independent part.

mazzzystar commented 4 years ago

So here is the purpose why use R: because we do not have a labeled keypoint dataset, So we need to "assume" there exist a standard frame R for each pair image (D, S), which is the origin point of a K-dim space . Then we can train Keypoint Detector and Motion Module in an unsupervised method.

Currently I have a good labeled keypoint dataset and pretrained model, can I replace Keypoint Detector with this pretrained model ? What's the advantage of a learned keypoint in an unsupervised way ?

AliaksandrSiarohin commented 4 years ago

1) More or less. Not sure why it is K-dim, it more like K two-dimensional coordinate systems.

2) Sure you can replace. Most of the times supervised key-points works better. Unsupervised key-points however can describe the movements that you may forget to annotate. Plus unsupervised keypoints can encode more staff per keypoint, for faces people usually use 68 supervised keypoints. While here I use 10.

mazzzystar commented 4 years ago

Thanks a lot. I've finished reading your paper but with a lot questions. Will explore the code for more precise understanding.

Sorry for wasting so much of your time, still appreciate your kindly explanation.

mazzzystar commented 4 years ago

I think there are some mistake in your video link , which you reverse the picture of training and testing.

This shoud be training,

and this should be testing.

Am I right ?

AliaksandrSiarohin commented 4 years ago

Yes.

mazzzystar commented 4 years ago

Hi, why you learn the jacobian rather than compute directly. is this because that the transformation from D<-R and S<-R remains unknown yet ? https://github.com/AliaksandrSiarohin/first-order-model/blob/6e18130da4e7931bc423afc85b4ea9f985a938c2/modules/keypoint_detector.py#L25 https://github.com/AliaksandrSiarohin/first-order-model/blob/6e18130da4e7931bc423afc85b4ea9f985a938c2/modules/keypoint_detector.py#L64

AliaksandrSiarohin commented 4 years ago

How you can compute it directly?

mazzzystar commented 4 years ago

No I'm not saying we "can" compute. It just a intuition that jacobian is computed rather than learned.. In this model, we don't know the exact function f(x) of the transformation from D<-R and S<-R, because your "learn" from network, so you then learn jacobian too, right ?

AliaksandrSiarohin commented 4 years ago

Right, you can define a function as a set of points. Imagine polynome, you can fit a polynom to the set of points. Similarly you define a function as a set of points and derivatives in this points. In that case you can fit polynome more precisely.

mazzzystar commented 4 years ago

Hi, currently I'm using labeled body pose keypoints(25 keypoints) as an auxiliary data to make keypoint-detector output more accurate prediction.

I tried several experiments but faild, below is what I did:

Experiment 1

1.change num_kp from 10 -> 25(we have 25 labeled keypoints for each image).
2.Add keypoints loss(L1) between 
   * (kp_source['value'], kp_source_gt) 
   * (kp_driving['value'], kp_driving_gt)  
in model.py

But the keypoint_loss did not converge. In fact the loss decrease in the first few epochs, but goes up later. Which shows in the training results that the predicted keypoints surround tightly with the ground truth keypoints at first few epochs, but later they spread out from human body to the whole image.

I guess maybe the model need some keypoints outside human body as "anchor". So Then my second experiments is:

Experiment 2

1.change num_kp from 25->35.
2.Only compute L1 loss for the [0:25] keypoints, let model learn freely for the remain 10 points as you did previously.

But it still did not converge either. I wonder whether it helps to omit the computation of Equivarance loss for these 25 keypoints ?

AliaksandrSiarohin commented 4 years ago

The outcome of exp.1 is weird since, I run the same experiment with l1 loss and it works just fine (as psedo-gt for taichi dataset I used keypoints from mask-rcnn). Did you preprocess you gt keypoints to lie in range [-1, 1]?

P.S. I use 10 as weight.

mazzzystar commented 4 years ago

Yes. Suppose the keypoints value is between 0~255, the norm code is like:

for idx in range(keypoints.shape[0]):
    keypoints[idx][0] = keypoints[idx][0] * 2.0 / 256.0 - 1.0
    keypoints[idx][1] = keypoints[idx][1] * 2.0 / 256.0 - 1.0

I think the way I normalize is right is because, in the first few epochs, the keypoints result shows in log/train-vis folder is almost the same with my gt.

I tried the weight with 1 and 100, both of their keypoint_loss does not converge.

Below is some training log:

00000000) keypoints - 0.72870; perceptual - 85.09825; equivariance_value - 0.37888; equivariance_jacobian - 1.21564
00000001) keypoints - 0.72111; perceptual - 67.02689; equivariance_value - 0.39522; equivariance_jacobian - 0.91020
00000002) keypoints - 0.72671; perceptual - 62.87188; equivariance_value - 0.41847; equivariance_jacobian - 0.89381
00000003) keypoints - 0.75022; perceptual - 64.42418; equivariance_value - 0.55899; equivariance_jacobian - 1.01328
...
00000009) keypoints - 0.83284; perceptual - 65.40307; equivariance_value - 0.68750; equivariance_jacobian - 1.08492
...
00000038) keypoints - 0.88260; perceptual - 48.71728; equivariance_value - 0.57458; equivariance_jacobian - 0.92466

Below is some result: gt keypoints(25 keypoint)

Epoch=1(the learned keypoints is similar with gt) Epoch=37

You can see that the keypoints prediction getting even wrose after many epochs.

AliaksandrSiarohin commented 4 years ago

Have you tried to set higher weight for this loss, like 100~1000? In the corner case disable reconstruction and check if current architecture can learn these keypoints?

mazzzystar commented 4 years ago

Yes, I tried the weight with 1 and 100, both do not converge.

Here is the log of using weight of 100, and use num_kp=35(25+10). It's first 0-25 keypoints computes L1 loss with gt, and the last 10 keypoints learn freely by model.

00000000) keypoints - 50.86939; perceptual - 88.85268; equivariance_value - 0.43549; equivariance_jacobian - 1.14775
00000001) keypoints - 50.88995; perceptual - 70.30543; equivariance_value - 0.45072; equivariance_jacobian - 1.12774
00000002) keypoints - 48.77094; perceptual - 65.31566; equivariance_value - 0.42617; equivariance_jacobian - 1.06001
00000003) keypoints - 47.57986; perceptual - 62.40099; equivariance_value - 0.41568; equivariance_jacobian - 1.03299
00000004) keypoints - 46.43290; perceptual - 60.69270; equivariance_value - 0.40178; equivariance_jacobian - 0.96974
00000005) keypoints - 45.62518; perceptual - 58.77817; equivariance_value - 0.38416; equivariance_jacobian - 0.93903
00000006) keypoints - 51.50448; perceptual - 65.14197; equivariance_value - 0.45672; equivariance_jacobian - 1.12598
00000007) keypoints - 58.98692; perceptual - 65.95474; equivariance_value - 0.59971; equivariance_jacobian - 1.08690
00000008) keypoints - 53.40548; perceptual - 60.80066; equivariance_value - 0.51847; equivariance_jacobian - 0.96814
00000009) keypoints - 50.96543; perceptual - 58.55516; equivariance_value - 0.49172; equivariance_jacobian - 0.92755
00000010) keypoints - 49.79808; perceptual - 56.87955; equivariance_value - 0.47488; equivariance_jacobian - 0.91852
00000011) keypoints - 48.34119; perceptual - 55.78221; equivariance_value - 0.46173; equivariance_jacobian - 0.91345
00000012) keypoints - 47.00036; perceptual - 56.56833; equivariance_value - 0.45706; equivariance_jacobian - 0.93313
00000013) keypoints - 45.65450; perceptual - 53.86573; equivariance_value - 0.41602; equivariance_jacobian - 0.88324
00000014) keypoints - 44.64090; perceptual - 52.73506; equivariance_value - 0.39366; equivariance_jacobian - 0.86567
00000015) keypoints - 44.39249; perceptual - 52.11871; equivariance_value - 0.38404; equivariance_jacobian - 0.86152
00000016) keypoints - 44.37613; perceptual - 52.11483; equivariance_value - 0.39545; equivariance_jacobian - 0.87787
00000017) keypoints - 45.07110; perceptual - 52.68429; equivariance_value - 0.44424; equivariance_jacobian - 0.89970
00000018) keypoints - 43.92156; perceptual - 50.91551; equivariance_value - 0.39971; equivariance_jacobian - 0.87089
00000019) keypoints - 43.87069; perceptual - 50.04563; equivariance_value - 0.39175; equivariance_jacobian - 0.87592
00000020) keypoints - 46.04107; perceptual - 53.44830; equivariance_value - 0.42999; equivariance_jacobian - 1.03032
00000021) keypoints - 44.82671; perceptual - 51.26324; equivariance_value - 0.39675; equivariance_jacobian - 0.95497
00000022) keypoints - 43.65029; perceptual - 49.72903; equivariance_value - 0.38864; equivariance_jacobian - 0.92795
00000023) keypoints - 46.51496; perceptual - 51.40187; equivariance_value - 0.43407; equivariance_jacobian - 0.93916
00000024) keypoints - 48.42650; perceptual - 52.67098; equivariance_value - 0.47015; equivariance_jacobian - 0.93772
00000025) keypoints - 48.80500; perceptual - 51.81656; equivariance_value - 0.45942; equivariance_jacobian - 0.93394
00000026) keypoints - 48.32848; perceptual - 50.54402; equivariance_value - 0.46661; equivariance_jacobian - 0.98627
00000027) keypoints - 47.78838; perceptual - 53.18620; equivariance_value - 0.48301; equivariance_jacobian - 1.29926
00000028) keypoints - 49.44703; perceptual - 56.16166; equivariance_value - 0.49107; equivariance_jacobian - 1.31511
00000029) keypoints - 46.36982; perceptual - 52.66882; equivariance_value - 0.41661; equivariance_jacobian - 1.07638
00000030) keypoints - 45.66925; perceptual - 51.47015; equivariance_value - 0.40706; equivariance_jacobian - 1.04226
00000031) keypoints - 45.05334; perceptual - 50.51558; equivariance_value - 0.42078; equivariance_jacobian - 1.01517
00000032) keypoints - 46.02104; perceptual - 50.46822; equivariance_value - 0.43037; equivariance_jacobian - 1.03055
00000033) keypoints - 46.93882; perceptual - 50.67167; equivariance_value - 0.44739; equivariance_jacobian - 1.08764
00000034) keypoints - 50.27180; perceptual - 53.05074; equivariance_value - 0.48403; equivariance_jacobian - 1.05178
00000035) keypoints - 48.32777; perceptual - 50.81616; equivariance_value - 0.45841; equivariance_jacobian - 0.99100
00000036) keypoints - 47.71800; perceptual - 50.00760; equivariance_value - 0.45344; equivariance_jacobian - 0.98169
00000037) keypoints - 47.36496; perceptual - 49.54895; equivariance_value - 0.44413; equivariance_jacobian - 3.38904
00000038) keypoints - 51.94344; perceptual - 52.82084; equivariance_value - 0.50460; equivariance_jacobian - 1.08494
00000039) keypoints - 49.90947; perceptual - 51.11156; equivariance_value - 0.46687; equivariance_jacobian - 1.38772
00000040) keypoints - 49.78874; perceptual - 50.15235; equivariance_value - 0.46128; equivariance_jacobian - 1.22905
00000041) keypoints - 49.45353; perceptual - 49.30089; equivariance_value - 0.47222; equivariance_jacobian - 1.06450
00000042) keypoints - 49.01676; perceptual - 48.76686; equivariance_value - 0.47005; equivariance_jacobian - 1.08518
00000043) keypoints - 49.58611; perceptual - 49.25830; equivariance_value - 0.46346; equivariance_jacobian - 1.06364
00000044) keypoints - 48.95047; perceptual - 48.54719; equivariance_value - 0.47880; equivariance_jacobian - 1.00930
00000045) keypoints - 48.97196; perceptual - 48.53550; equivariance_value - 0.46838; equivariance_jacobian - 2.29279
00000046) keypoints - 53.06868; perceptual - 50.62860; equivariance_value - 0.53318; equivariance_jacobian - 4.47767
00000047) keypoints - 58.14801; perceptual - 51.66694; equivariance_value - 0.56764; equivariance_jacobian - 2.17262
00000048) keypoints - 55.21200; perceptual - 49.57411; equivariance_value - 0.49467; equivariance_jacobian - 1.43886
00000049) keypoints - 53.88405; perceptual - 49.19135; equivariance_value - 0.48464; equivariance_jacobian - 2.09273
00000050) keypoints - 53.68293; perceptual - 50.33981; equivariance_value - 0.50939; equivariance_jacobian - 1.98419
00000051) keypoints - 52.83159; perceptual - 48.89457; equivariance_value - 0.50193; equivariance_jacobian - 1.62819
00000052) keypoints - 52.93349; perceptual - 49.47706; equivariance_value - 0.49637; equivariance_jacobian - 2.13661
00000053) keypoints - 52.45208; perceptual - 49.43612; equivariance_value - 0.49561; equivariance_jacobian - 1.93431
00000054) keypoints - 55.45502; perceptual - 50.19393; equivariance_value - 0.51936; equivariance_jacobian - 2.87608
00000055) keypoints - 52.31169; perceptual - 48.90309; equivariance_value - 0.48177; equivariance_jacobian - 1.20168
00000056) keypoints - 50.76933; perceptual - 48.60610; equivariance_value - 0.45259; equivariance_jacobian - 1.29898
00000057) keypoints - 51.28873; perceptual - 48.50206; equivariance_value - 0.44419; equivariance_jacobian - 1.39919
00000058) keypoints - 50.67537; perceptual - 48.20142; equivariance_value - 0.44409; equivariance_jacobian - 1.18537
00000059) keypoints - 50.27187; perceptual - 47.27078; equivariance_value - 0.43233; equivariance_jacobian - 1.13803
00000060) keypoints - 49.83851; perceptual - 47.42186; equivariance_value - 0.46798; equivariance_jacobian - 1.32639
00000061) keypoints - 58.33647; perceptual - 50.94423; equivariance_value - 0.53403; equivariance_jacobian - 1.37952
00000062) keypoints - 56.09979; perceptual - 52.26786; equivariance_value - 0.54984; equivariance_jacobian - 1.88283
00000063) keypoints - 53.48256; perceptual - 51.23494; equivariance_value - 0.52915; equivariance_jacobian - 1.26982
00000064) keypoints - 52.39757; perceptual - 50.20129; equivariance_value - 0.48697; equivariance_jacobian - 1.44103

I wonder why you use num_kp=10 rather than 20/30 or else. Does the increase of keypoints number affect model learning ?

AliaksandrSiarohin commented 4 years ago

Guess as debugging exp you should try to disable reconstruction loss. And see if you able to train just a detector with your gt keypoints.

I tried to use more keypoints, but more kp you use it will be more likely that the appearance will leak trough these keypoints. Which will complicate the animation. Also for Taichi dataset, it is not beneficial since the model will try to model bg with these additional keypoints.

mazzzystar commented 4 years ago

I think the kp_detector is not the same meaning with human pose detector.

Because all my training results of self-supervised kp_detector, most keypoints are around human body, not inside body(which gt did.)， and there always 1~2 keypoints in somewhere solid place of background.

AliaksandrSiarohin commented 4 years ago

I think the kp_detector is not the same meaning with human pose detector.

Not sure what you mean here.

Also have you checked that axis is in the same order, e.g x first y second or other way around? What results did you get if you replace keypoints predicted with kp_detector with gt_keypoints?

mazzzystar commented 4 years ago

What I mean is, the kp_detector do NOT really learns the human pose keypoints, it tends to learrn a solid background and the most drastic change of motion.

See these 10 free learned keypoints(your origin model), the predicted keypoints are always around human body & background.

I checked the order and it's correct.

I didn't try to replace keypoints predicted with kp_detector with gt_keypoints, currently it just use gt to constraint the predicted keypoints.

AliaksandrSiarohin commented 4 years ago

This 2 point is failing because of the equivariance loss. E.g equivariance property is easy to satisfy if the point lie on the edge of the image.

Ok then I suggest you to try this out, us gt_kp directly.

mazzzystar commented 4 years ago

OK I'll try & give you response.

mazzzystar commented 4 years ago

HI I want to make sure how to replace predict with gt, below is what I try to do:

1.Don't use U-net heatmap predictor in kp_detector model. 
2.Use directly gt keypoints to generate heatmap H.
3.Generate jacobian with 1-layer conv(H)

As we still need jacobian, wo can not directly remove kp_detector module and only replace predict with gt in model.py

CrossLee1 commented 4 years ago

@mazzzystar do you have some progress with this exp? How is it going Thanks~

HI I want to make sure how to replace predict with gt, below is what I try to do:
1.Don't use U-net heatmap predictor in kp_detector model. 
2.Use directly gt keypoints to generate heatmap H.
3.Generate jacobian with 1-layer conv(H)
As we still need jacobian, wo can not directly remove kp_detector module and only replace predict with gt in model.py

mazzzystar commented 4 years ago

@CrossLee1 Yes I've figured out how to train "first-order-motion" with GT keypoints:

Use L1 loss for pushing model to learn GT keypoints, the L1_coff = 50.0.
Change the equivarance coff below to 50x larger(which means 500.0).

https://github.com/AliaksandrSiarohin/first-order-model/blob/9e7a1ae6b0cc1090c40cd3f3a4c3d9d20f482649/config/taichi-256.yaml#L124-L126

Below is my results on 16-keypoints:

CrossLee1 commented 4 years ago

@mazzzystar

seems you follow the setting of exp1:

Experiment 1

1.change num_kp from 10 -> 25(we have 25 labeled keypoints for each image).
2.Add keypoints loss(L1) between 
   * (kp_source['value'], kp_source_gt) 
   * (kp_driving['value'], kp_driving_gt)  
in model.py

the difference is the coefficients of L1_loss and equivariance_jacobian, right?

mazzzystar commented 4 years ago

Right. Beware that these parameters are for 256x256 image size with human-body segmented, when the input is in different size, or not segmented, you may need to explore a new group of paramenters.

Lotayou commented 4 years ago

@CrossLee1 Yes I've figured out how to train "first-order-motion" with GT keypoints:

Use L1 loss for pushing model to learn GT keypoints, the L1_coff = 50.0.

Change the equivarance coff below to 50x larger(which means 500.0).

https://github.com/AliaksandrSiarohin/first-order-model/blob/9e7a1ae6b0cc1090c40cd3f3a4c3d9d20f482649/config/taichi-256.yaml#L124-L126

Below is my results on 16-keypoints:

@mazzzystar

Hi, I have two questions regrading your implementation, can you help me out?

why should we still need to add L1 loss on keypoints if we already use gt keypoints at input?
How to get the ground truth keypoint coordinates of the transformed frame when calculating equivalence losses? I tried to call the warp_coordinates method in the Transformation class but got it all wrong. It seems we need to solve the inverse transformation of the driving keypoints in order to get the correct warped keypoints, but that is not mathematically possible (since tps is irreversible) Thanks!

mazzzystar commented 4 years ago

@Lotayou

No we do NOT use gt keypoints as input, its only be used as supervision. Q1: Because we want our model to learn to predict the keypoints, rather than rely on the gt input when inference. Q2: Related to Q1, when you got to learn from gt keypoints with your kp_dectctor(rather than use gt keypoints as input)， then when frames are transformed, you can call kp_dectctor function twice to predict the origin/transformed keypoints.

= Update I don't know why you need the keypoints of the transformed frame, the transformation function is already known, so the transformation & warp process is almost transparent.

lastsongforu commented 3 years ago

@Lotayou

No we do NOT use gt keypoints as input, its only be used as supervision. Q1: Because we want our model to learn to predict the keypoints, rather than rely on the gt input when inference. Q2: Related to Q1, when you got to learn from gt keypoints with your kp_dectctor(rather than use gt keypoints as input)， then when frames are transformed, you can call kp_dectctor function twice to predict the origin/transformed keypoints.

= Update I don't know why you need the keypoints of the transformed frame, the transformation function is already known, so the transformation & warp process is almost transparent.

@mazzzystar hi, have you tried using gt keypoints as input directly? is it better compared with using it as supervision? it seems direct input could make the process more controllable？

AliaksandrSiarohin / first-order-model

A few basic question on implementation detail. #109

Experiment 1

Experiment 2