Open mazzzystar opened 4 years ago
1) Colors are right. 2) I need to compute the source coordinate for each driving coordinate (z). So identity grid is just contain all the driving coordinates in a grid, e.g [-1,1]x[-1,1]. 3) Hourglass is just a type of architecture. In paper it is called Unet.
Thanks for quick response, I will check the code part of z
.
After carefully read your paper, I still cannot understand why use the reference frame R
. What's the benefit of using S<-R<-D
rather than computing directly S<-D
? Sorry for the stupid question.
The lack of paired data make this approach not meaningful. So imagine you train a network to predict a transformation S<-D from [S ¦¦ D], S concatenated with D. At training time you can only use frames from the same video, while at test time you will need to use frames from different videos. The network will likely never generalize to frames from different videos, since it never saw any.
To this end we trying to learn a network that make an independent predictions for S and D. And to define it properly we introduce this R.
If the purpose of introducing R is to prevent "train on same video, test on different", can we just use random [S, D] pair which from different video to train network, without using R ?
Also, currently your model still train D and S from the same video( if I'm not wrong), therefore how to prevent the situation you mentioned ? https://github.com/AliaksandrSiarohin/first-order-model/blob/902f83a4217c75c842e1f536b3331c5032b703a2/frames_dataset.py#L110-L114
No this is not the purpose. The purpose is to make independent motion predictions for S, D. If the motion predictions is dependent on each other, e.g. Keypoint predictor use a concatenation of S and D, it won't generalise.
"I need to compute the source coordinate for each driving coordinate (z). So identity grid is just contain all the driving coordinates in a grid, e.g [-1,1]x[-1,1]."
Here I still can't understand the difference between driving coordinate (z)
and kp_driving
. I think kp_driving
means the relative coordinate of K
keypoints from R->D
, and the driving coordinate
means the local pixels around each z(k)
.
If so, why we can reprersent z
with an identity_grid ?
Yes true z is local pixel around z_k. However at the point where I produce this sparse motions, I don't know what will be the neiboirhoods. So I compute the transformation for all the coordinates in driving, and select the neiboirhoods afterwards. Note that all possible coordinates in the driving frame that we potentially may need for warping can be produced by identity_grid.
And, if the purpose is tto make independent motion predictions for S, D. Then I think we should optimize D and S independently, such as we optimize (R, S) and (R, D).
But in your code, the R is always used as an intermediate variable for S<-D. we do not add restrict to R, so how can we make sure D and S trained independently ?
And, if the purpose is tto make independent motion predictions for S, D. Then I think we should optimize D and S independently, such as we optimize (R, S) and (R, D).
But in your code, the R is always used as an intermediate variable for S<-D. we do not add restrict to R, so how can we make sure D and S trained independently ?
Later only motion information from driving is used. So prediction will be independed of the driving appearance.
Guess it will be easier if you say how you want to implement it, and I will say why this would not work.
OK, for example:
1.Select random frame D0 and S0 which are similar from different videos. 2.For each frame of Di videos, compute keypoints and heatmap, and the sparse motion between(D0, Di). 3.Predict new S from (S0, spare motion) with Di motion and S0 appearance.
Currently I havn't totally understand your paper, so maybe I've gone to a wrong direction.
Yes, this would be ideal training scheme. Note that for step 3 you need a ground truth for S. So the training data required is videos (S and D) where 2 object perform exactly the same movements, and it is not possible to find it from in the wild videos.
But why I need full S video? I only use S0 in the whole training phase.
You mean the ground truth S for reconstruction loss, which is the same appearance with S0 and same motion with Di ?
But why I need full S video? I only use S0 in the whole training phase.
The network is trained with reconstruction loss, so you will need to compute
|| S - \hat(S)||
Where \hat(S) is ground truth video.
Now I understand.
But What if I can warp S0
with sparse motion between D0
and Di
in step 2 ? So the warped new S0 can be the ground truth.
Its the similar idea with yours, but without the use of R.
What do you mean by align?
sorry, it should be "warp"
In that case what will be your training signal for obtaining sparse motions between D0 and Di, in the first place?
I think the sparse motion can be direrctly computed from (kp_Di, kp_D0, D0) ?
Yes, but how to obtain kp_Di and Kp_D0?
With pretrained keypoint detecor?
Yes, but the whole purpose of the paper is to avoid it. To be able to train on arbitrary objects.
Though until now I didn't recognized that keypoint training is associated with motion module. Before now I thought keypoint training is an independent part.
So here is the purpose why use R: because we do not have a labeled keypoint dataset, So we need to "assume" there exist a standard frame R for each pair image (D, S), which is the origin point of a K-dim space . Then we can train Keypoint Detector
and Motion Module
in an unsupervised method.
Currently I have a good labeled keypoint dataset and pretrained model, can I replace Keypoint Detector
with this pretrained model ? What's the advantage of a learned keypoint in an unsupervised way ?
1) More or less. Not sure why it is K-dim, it more like K two-dimensional coordinate systems.
2) Sure you can replace. Most of the times supervised key-points works better. Unsupervised key-points however can describe the movements that you may forget to annotate. Plus unsupervised keypoints can encode more staff per keypoint, for faces people usually use 68 supervised keypoints. While here I use 10.
Thanks a lot. I've finished reading your paper but with a lot questions. Will explore the code for more precise understanding.
Sorry for wasting so much of your time, still appreciate your kindly explanation.
I think there are some mistake in your video link , which you reverse the picture of training and testing.
This shoud be training,
and this should be testing.
Am I right ?
Yes.
Hi, why you learn the jacobian
rather than compute directly. is this because that the transformation from D<-R and S<-R remains unknown yet ?
https://github.com/AliaksandrSiarohin/first-order-model/blob/6e18130da4e7931bc423afc85b4ea9f985a938c2/modules/keypoint_detector.py#L25
https://github.com/AliaksandrSiarohin/first-order-model/blob/6e18130da4e7931bc423afc85b4ea9f985a938c2/modules/keypoint_detector.py#L64
How you can compute it directly?
No I'm not saying we "can" compute. It just a intuition that jacobian is computed rather than learned.. In this model, we don't know the exact function f(x)
of the transformation from D<-R and S<-R, because your "learn" from network, so you then learn jacobian too, right ?
Right, you can define a function as a set of points. Imagine polynome, you can fit a polynom to the set of points. Similarly you define a function as a set of points and derivatives in this points. In that case you can fit polynome more precisely.
Hi, currently I'm using labeled body pose keypoints(25 keypoints) as an auxiliary data to make keypoint-detector
output more accurate prediction.
I tried several experiments but faild, below is what I did:
1.change num_kp from 10 -> 25(we have 25 labeled keypoints for each image).
2.Add keypoints loss(L1) between
* (kp_source['value'], kp_source_gt)
* (kp_driving['value'], kp_driving_gt)
in model.py
But the keypoint_loss
did not converge. In fact the loss decrease in the first few epochs, but goes up later. Which shows in the training results that the predicted keypoints surround tightly with the ground truth keypoints at first few epochs, but later they spread out from human body to the whole image.
I guess maybe the model need some keypoints outside human body as "anchor". So Then my second experiments is:
1.change num_kp from 25->35.
2.Only compute L1 loss for the [0:25] keypoints, let model learn freely for the remain 10 points as you did previously.
But it still did not converge either. I wonder whether it helps to omit the computation of Equivarance loss
for these 25 keypoints ?
The outcome of exp.1 is weird since, I run the same experiment with l1 loss and it works just fine (as psedo-gt for taichi dataset I used keypoints from mask-rcnn). Did you preprocess you gt keypoints to lie in range [-1, 1]?
P.S. I use 10 as weight.
Yes. Suppose the keypoints value is between 0~255, the norm code is like:
for idx in range(keypoints.shape[0]):
keypoints[idx][0] = keypoints[idx][0] * 2.0 / 256.0 - 1.0
keypoints[idx][1] = keypoints[idx][1] * 2.0 / 256.0 - 1.0
I think the way I normalize is right is because, in the first few epochs, the keypoints result shows in log/train-vis
folder is almost the same with my gt.
I tried the weight with 1 and 100, both of their keypoint_loss
does not converge.
Below is some training log:
00000000) keypoints - 0.72870; perceptual - 85.09825; equivariance_value - 0.37888; equivariance_jacobian - 1.21564
00000001) keypoints - 0.72111; perceptual - 67.02689; equivariance_value - 0.39522; equivariance_jacobian - 0.91020
00000002) keypoints - 0.72671; perceptual - 62.87188; equivariance_value - 0.41847; equivariance_jacobian - 0.89381
00000003) keypoints - 0.75022; perceptual - 64.42418; equivariance_value - 0.55899; equivariance_jacobian - 1.01328
...
00000009) keypoints - 0.83284; perceptual - 65.40307; equivariance_value - 0.68750; equivariance_jacobian - 1.08492
...
00000038) keypoints - 0.88260; perceptual - 48.71728; equivariance_value - 0.57458; equivariance_jacobian - 0.92466
Below is some result:
gt keypoints
(25 keypoint)
Epoch=1
(the learned keypoints is similar with gt)
Epoch=37
You can see that the keypoints prediction getting even wrose after many epochs.
Have you tried to set higher weight for this loss, like 100~1000? In the corner case disable reconstruction and check if current architecture can learn these keypoints?
Yes, I tried the weight with 1 and 100, both do not converge.
Here is the log of using weight of 100, and use num_kp=35
(25+10). It's first 0-25 keypoints computes L1 loss with gt, and the last 10 keypoints learn freely by model.
00000000) keypoints - 50.86939; perceptual - 88.85268; equivariance_value - 0.43549; equivariance_jacobian - 1.14775
00000001) keypoints - 50.88995; perceptual - 70.30543; equivariance_value - 0.45072; equivariance_jacobian - 1.12774
00000002) keypoints - 48.77094; perceptual - 65.31566; equivariance_value - 0.42617; equivariance_jacobian - 1.06001
00000003) keypoints - 47.57986; perceptual - 62.40099; equivariance_value - 0.41568; equivariance_jacobian - 1.03299
00000004) keypoints - 46.43290; perceptual - 60.69270; equivariance_value - 0.40178; equivariance_jacobian - 0.96974
00000005) keypoints - 45.62518; perceptual - 58.77817; equivariance_value - 0.38416; equivariance_jacobian - 0.93903
00000006) keypoints - 51.50448; perceptual - 65.14197; equivariance_value - 0.45672; equivariance_jacobian - 1.12598
00000007) keypoints - 58.98692; perceptual - 65.95474; equivariance_value - 0.59971; equivariance_jacobian - 1.08690
00000008) keypoints - 53.40548; perceptual - 60.80066; equivariance_value - 0.51847; equivariance_jacobian - 0.96814
00000009) keypoints - 50.96543; perceptual - 58.55516; equivariance_value - 0.49172; equivariance_jacobian - 0.92755
00000010) keypoints - 49.79808; perceptual - 56.87955; equivariance_value - 0.47488; equivariance_jacobian - 0.91852
00000011) keypoints - 48.34119; perceptual - 55.78221; equivariance_value - 0.46173; equivariance_jacobian - 0.91345
00000012) keypoints - 47.00036; perceptual - 56.56833; equivariance_value - 0.45706; equivariance_jacobian - 0.93313
00000013) keypoints - 45.65450; perceptual - 53.86573; equivariance_value - 0.41602; equivariance_jacobian - 0.88324
00000014) keypoints - 44.64090; perceptual - 52.73506; equivariance_value - 0.39366; equivariance_jacobian - 0.86567
00000015) keypoints - 44.39249; perceptual - 52.11871; equivariance_value - 0.38404; equivariance_jacobian - 0.86152
00000016) keypoints - 44.37613; perceptual - 52.11483; equivariance_value - 0.39545; equivariance_jacobian - 0.87787
00000017) keypoints - 45.07110; perceptual - 52.68429; equivariance_value - 0.44424; equivariance_jacobian - 0.89970
00000018) keypoints - 43.92156; perceptual - 50.91551; equivariance_value - 0.39971; equivariance_jacobian - 0.87089
00000019) keypoints - 43.87069; perceptual - 50.04563; equivariance_value - 0.39175; equivariance_jacobian - 0.87592
00000020) keypoints - 46.04107; perceptual - 53.44830; equivariance_value - 0.42999; equivariance_jacobian - 1.03032
00000021) keypoints - 44.82671; perceptual - 51.26324; equivariance_value - 0.39675; equivariance_jacobian - 0.95497
00000022) keypoints - 43.65029; perceptual - 49.72903; equivariance_value - 0.38864; equivariance_jacobian - 0.92795
00000023) keypoints - 46.51496; perceptual - 51.40187; equivariance_value - 0.43407; equivariance_jacobian - 0.93916
00000024) keypoints - 48.42650; perceptual - 52.67098; equivariance_value - 0.47015; equivariance_jacobian - 0.93772
00000025) keypoints - 48.80500; perceptual - 51.81656; equivariance_value - 0.45942; equivariance_jacobian - 0.93394
00000026) keypoints - 48.32848; perceptual - 50.54402; equivariance_value - 0.46661; equivariance_jacobian - 0.98627
00000027) keypoints - 47.78838; perceptual - 53.18620; equivariance_value - 0.48301; equivariance_jacobian - 1.29926
00000028) keypoints - 49.44703; perceptual - 56.16166; equivariance_value - 0.49107; equivariance_jacobian - 1.31511
00000029) keypoints - 46.36982; perceptual - 52.66882; equivariance_value - 0.41661; equivariance_jacobian - 1.07638
00000030) keypoints - 45.66925; perceptual - 51.47015; equivariance_value - 0.40706; equivariance_jacobian - 1.04226
00000031) keypoints - 45.05334; perceptual - 50.51558; equivariance_value - 0.42078; equivariance_jacobian - 1.01517
00000032) keypoints - 46.02104; perceptual - 50.46822; equivariance_value - 0.43037; equivariance_jacobian - 1.03055
00000033) keypoints - 46.93882; perceptual - 50.67167; equivariance_value - 0.44739; equivariance_jacobian - 1.08764
00000034) keypoints - 50.27180; perceptual - 53.05074; equivariance_value - 0.48403; equivariance_jacobian - 1.05178
00000035) keypoints - 48.32777; perceptual - 50.81616; equivariance_value - 0.45841; equivariance_jacobian - 0.99100
00000036) keypoints - 47.71800; perceptual - 50.00760; equivariance_value - 0.45344; equivariance_jacobian - 0.98169
00000037) keypoints - 47.36496; perceptual - 49.54895; equivariance_value - 0.44413; equivariance_jacobian - 3.38904
00000038) keypoints - 51.94344; perceptual - 52.82084; equivariance_value - 0.50460; equivariance_jacobian - 1.08494
00000039) keypoints - 49.90947; perceptual - 51.11156; equivariance_value - 0.46687; equivariance_jacobian - 1.38772
00000040) keypoints - 49.78874; perceptual - 50.15235; equivariance_value - 0.46128; equivariance_jacobian - 1.22905
00000041) keypoints - 49.45353; perceptual - 49.30089; equivariance_value - 0.47222; equivariance_jacobian - 1.06450
00000042) keypoints - 49.01676; perceptual - 48.76686; equivariance_value - 0.47005; equivariance_jacobian - 1.08518
00000043) keypoints - 49.58611; perceptual - 49.25830; equivariance_value - 0.46346; equivariance_jacobian - 1.06364
00000044) keypoints - 48.95047; perceptual - 48.54719; equivariance_value - 0.47880; equivariance_jacobian - 1.00930
00000045) keypoints - 48.97196; perceptual - 48.53550; equivariance_value - 0.46838; equivariance_jacobian - 2.29279
00000046) keypoints - 53.06868; perceptual - 50.62860; equivariance_value - 0.53318; equivariance_jacobian - 4.47767
00000047) keypoints - 58.14801; perceptual - 51.66694; equivariance_value - 0.56764; equivariance_jacobian - 2.17262
00000048) keypoints - 55.21200; perceptual - 49.57411; equivariance_value - 0.49467; equivariance_jacobian - 1.43886
00000049) keypoints - 53.88405; perceptual - 49.19135; equivariance_value - 0.48464; equivariance_jacobian - 2.09273
00000050) keypoints - 53.68293; perceptual - 50.33981; equivariance_value - 0.50939; equivariance_jacobian - 1.98419
00000051) keypoints - 52.83159; perceptual - 48.89457; equivariance_value - 0.50193; equivariance_jacobian - 1.62819
00000052) keypoints - 52.93349; perceptual - 49.47706; equivariance_value - 0.49637; equivariance_jacobian - 2.13661
00000053) keypoints - 52.45208; perceptual - 49.43612; equivariance_value - 0.49561; equivariance_jacobian - 1.93431
00000054) keypoints - 55.45502; perceptual - 50.19393; equivariance_value - 0.51936; equivariance_jacobian - 2.87608
00000055) keypoints - 52.31169; perceptual - 48.90309; equivariance_value - 0.48177; equivariance_jacobian - 1.20168
00000056) keypoints - 50.76933; perceptual - 48.60610; equivariance_value - 0.45259; equivariance_jacobian - 1.29898
00000057) keypoints - 51.28873; perceptual - 48.50206; equivariance_value - 0.44419; equivariance_jacobian - 1.39919
00000058) keypoints - 50.67537; perceptual - 48.20142; equivariance_value - 0.44409; equivariance_jacobian - 1.18537
00000059) keypoints - 50.27187; perceptual - 47.27078; equivariance_value - 0.43233; equivariance_jacobian - 1.13803
00000060) keypoints - 49.83851; perceptual - 47.42186; equivariance_value - 0.46798; equivariance_jacobian - 1.32639
00000061) keypoints - 58.33647; perceptual - 50.94423; equivariance_value - 0.53403; equivariance_jacobian - 1.37952
00000062) keypoints - 56.09979; perceptual - 52.26786; equivariance_value - 0.54984; equivariance_jacobian - 1.88283
00000063) keypoints - 53.48256; perceptual - 51.23494; equivariance_value - 0.52915; equivariance_jacobian - 1.26982
00000064) keypoints - 52.39757; perceptual - 50.20129; equivariance_value - 0.48697; equivariance_jacobian - 1.44103
I wonder why you use num_kp=10
rather than 20/30 or else. Does the increase of keypoints number affect model learning ?
Guess as debugging exp you should try to disable reconstruction loss. And see if you able to train just a detector with your gt keypoints.
I tried to use more keypoints, but more kp you use it will be more likely that the appearance will leak trough these keypoints. Which will complicate the animation. Also for Taichi dataset, it is not beneficial since the model will try to model bg with these additional keypoints.
I think the kp_detector
is not the same meaning with human pose detector
.
Because all my training results of self-supervised kp_detector
, most keypoints are around human body, not inside body(which gt did.), and there always 1~2 keypoints in somewhere solid place of background.
I think the
kp_detector
is not the same meaning withhuman pose detector
.
Not sure what you mean here.
Also have you checked that axis is in the same order, e.g x first y second or other way around? What results did you get if you replace keypoints predicted with kp_detector with gt_keypoints?
What I mean is, the kp_detector
do NOT really learns the human pose keypoints, it tends to learrn a solid background and the most drastic change of motion.
See these 10 free learned keypoints(your origin model), the predicted keypoints are always around human body & background.
I checked the order and it's correct.
I didn't try to replace keypoints predicted with kp_detector with gt_keypoints, currently it just use gt to constraint the predicted keypoints.
This 2 point is failing because of the equivariance loss. E.g equivariance property is easy to satisfy if the point lie on the edge of the image.
Ok then I suggest you to try this out, us gt_kp directly.
OK I'll try & give you response.
HI I want to make sure how to replace predict with gt, below is what I try to do:
1.Don't use U-net heatmap predictor in kp_detector model.
2.Use directly gt keypoints to generate heatmap H.
3.Generate jacobian with 1-layer conv(H)
As we still need jacobian
, wo can not directly remove kp_detector
module and only replace predict with gt in model.py
@mazzzystar do you have some progress with this exp? How is it going Thanks~
HI I want to make sure how to replace predict with gt, below is what I try to do:
1.Don't use U-net heatmap predictor in kp_detector model. 2.Use directly gt keypoints to generate heatmap H. 3.Generate jacobian with 1-layer conv(H)
As we still need
jacobian
, wo can not directly removekp_detector
module and only replace predict with gt inmodel.py
@CrossLee1 Yes I've figured out how to train "first-order-motion" with GT keypoints:
L1 loss
for pushing model to learn GT keypoints, the L1_coff = 50.0
.equivarance
coff below to 50x larger(which means 500.0).Below is my results on 16-keypoints:
@mazzzystar
seems you follow the setting of exp1:
Experiment 1
1.change num_kp from 10 -> 25(we have 25 labeled keypoints for each image).
2.Add keypoints loss(L1) between
* (kp_source['value'], kp_source_gt)
* (kp_driving['value'], kp_driving_gt)
in model.py
the difference is the coefficients of L1_loss
and equivariance_jacobian
, right?
Right.
Beware that these parameters are for 256x256
image size with human-body segmented, when the input is in different size, or not segmented, you may need to explore a new group of paramenters.
@CrossLee1 Yes I've figured out how to train "first-order-motion" with GT keypoints:
- Use
L1 loss
for pushing model to learn GT keypoints, theL1_coff = 50.0
.- Change the
equivarance
coff below to 50x larger(which means 500.0).Below is my results on 16-keypoints:
@mazzzystar
Hi, I have two questions regrading your implementation, can you help me out?
warp_coordinates
method in the Transformation
class but got it all wrong. It seems we need to solve the inverse transformation of the driving keypoints in order to get the correct warped keypoints, but that is not mathematically possible (since tps is irreversible)
Thanks!@Lotayou
No we do NOT use gt keypoints as input, its only be used as supervision.
Q1: Because we want our model to learn to predict the keypoints, rather than rely on the gt input when inference.
Q2: Related to Q1, when you got to learn from gt keypoints with your kp_dectctor
(rather than use gt keypoints as input), then when frames are transformed, you can call kp_dectctor
function twice to predict the origin/transformed keypoints.
= Update I don't know why you need the keypoints of the transformed frame, the transformation function is already known, so the transformation & warp process is almost transparent.
@Lotayou
No we do NOT use gt keypoints as input, its only be used as supervision. Q1: Because we want our model to learn to predict the keypoints, rather than rely on the gt input when inference. Q2: Related to Q1, when you got to learn from gt keypoints with your
kp_dectctor
(rather than use gt keypoints as input), then when frames are transformed, you can callkp_dectctor
function twice to predict the origin/transformed keypoints.= Update I don't know why you need the keypoints of the transformed frame, the transformation function is already known, so the transformation & warp process is almost transparent.
@mazzzystar hi, have you tried using gt keypoints as input directly? is it better compared with using it as supervision? it seems direct input could make the process more controllable?
Thanks for your work ! Here are some questions:
About formula 4 and your implementation. Is this right for each kind of color of variabel match with the original formula ? In my undestanding, the references frame
R
means the origin point ofK
dim space in virtual coordinates, andz
denotes "the space coordinate of D", but in your code it's a standard meshgrid, theidentity_grid
.About
HourGlass
I don't see theHourGlass
module in your paper, but in your code https://github.com/AliaksandrSiarohin/first-order-model/blob/902f83a4217c75c842e1f536b3331c5032b703a2/modules/dense_motion.py#L15What does it stand for ? Cause we alrerady have a) the warp module b) the occlusion module.
I will appreciate for your answer ~