Snapshot generated after training not generating same results as provided snapshot

BornInWater commented 5 years ago

Hi, When I tested the model using the snapshot provided, I got the below result saved_image

But when I trained the model while following the instructions provided, similar snapshot is not able to generate similar output saved_image2_500

I checked with snapshots later in the training as well, similar output as above.

The only thing I have changed is that I ported the code to Torch 1.0 and had to make one small change: replace things like self.total_loss.data[0] with self.total_loss.item(). I am sure this is not the reason but just mentioned it for the sake of completion.

Could you suggest what might I be doing wrong ?

EDIT: The losses are way higher as compared to the loss_log I got when I downloaded the snapshot. losses

BornInWater commented 5 years ago

I narrowed it down somewhat. All the losses while training seem to match with the loss_log that I got after I downloaded the snapshot apart from tri_loss. Even though it looks same as the one in loss_logs, when I checked individually, it is contributing a lot to the total loss because of multiplication by "triangle_reg_wt" which is set to 30 in shape.py. Should it be this high?

UPDATE: Trained the network with "triangle_reg_wt" set to 1. Though the losses start out similar, after 500 epochs, they are more than the loss in the provided logs and the generated outputs are similar to those I uploaded before. saved_image

BornInWater commented 5 years ago

Update: With Pytoch 0.3 + proper NMR version: (epoch: 0, iters: 1) smoothed_total_loss: 0.137 total_loss: 13.663 kp_loss: 0.111 mask_loss: 0.096 vert2kp_loss: 6.465 deform_reg: 0.068 tri_loss: 0.205 cam_loss: 0.659 tex_loss: 1.839 tex_dt_loss: 0.054
(epoch: 0, iters: 2) smoothed_total_loss: 0.258 total_loss: 12.314 kp_loss: 0.097 mask_loss: 0.099 vert2kp_loss: 6.465 deform_reg: 0.065 tri_loss: 0.178 cam_loss: 0.628 tex_loss: 1.671 tex_dt_loss: 0.169
(epoch: 0, iters: 3) smoothed_total_loss: 0.377 total_loss: 12.132 kp_loss: 0.114 mask_loss: 0.092 vert2kp_loss: 6.465 deform_reg: 0.063 tri_loss: 0.156 cam_loss: 0.617 tex_loss: 1.799 tex_dt_loss: 0.124
(epoch: 0, iters: 4) smoothed_total_loss: 0.485 total_loss: 11.157 kp_loss: 0.108 mask_loss: 0.105 vert2kp_loss: 6.465 deform_reg: 0.060 tri_loss: 0.138 cam_loss: 0.453 tex_loss: 1.871 tex_dt_loss: 0.187
(epoch: 0, iters: 5) smoothed_total_loss: 0.588 total_loss: 10.785 kp_loss: 0.100 mask_loss: 0.095 vert2kp_loss: 6.465 deform_reg: 0.058 tri_loss: 0.127 cam_loss: 0.580 tex_loss: 1.864 tex_dt_loss: 0.118
(epoch: 0, iters: 6) smoothed_total_loss: 0.692 total_loss: 11.033 kp_loss: 0.120 mask_loss: 0.105 vert2kp_loss: 6.465 deform_reg: 0.056 tri_loss: 0.116 cam_loss: 0.579 tex_loss: 1.897 tex_dt_loss: 0.068
(epoch: 0, iters: 7) smoothed_total_loss: 0.788 total_loss: 10.243 kp_loss: 0.113 mask_loss: 0.106 vert2kp_loss: 6.465 deform_reg: 0.054 tri_loss: 0.108 cam_loss: 0.409 tex_loss: 1.868 tex_dt_loss: 0.131
(epoch: 0, iters: 8) smoothed_total_loss: 0.874 total_loss: 9.365 kp_loss: 0.098 mask_loss: 0.094 vert2kp_loss: 6.465 deform_reg: 0.053 tri_loss: 0.098 cam_loss: 0.402 tex_loss: 1.781 tex_dt_loss: 0.082
(epoch: 0, iters: 9) smoothed_total_loss: 0.964 total_loss: 9.923 kp_loss: 0.102 mask_loss: 0.108 vert2kp_loss: 6.465 deform_reg: 0.051 tri_loss: 0.093 cam_loss: 0.661 tex_loss: 1.733 tex_dt_loss: 0.251
(epoch: 0, iters: 10) smoothed_total_loss: 1.049 total_loss: 9.399 kp_loss: 0.111 mask_loss: 0.089 vert2kp_loss: 6.465 deform_reg: 0.050 tri_loss: 0.088 cam_loss: 0.405 tex_loss: 1.737 tex_dt_loss: 0.104
(epoch: 0, iters: 11) smoothed_total_loss: 1.136 total_loss: 9.795 kp_loss: 0.123 mask_loss: 0.100 vert2kp_loss: 6.465 deform_reg: 0.048 tri_loss: 0.085 cam_loss: 0.459 tex_loss: 1.782 tex_dt_loss: 0.078
(epoch: 0, iters: 12) smoothed_total_loss: 1.216 total_loss: 9.095 kp_loss: 0.107 mask_loss: 0.097 vert2kp_loss: 6.465 deform_reg: 0.047 tri_loss: 0.081 cam_loss: 0.402 tex_loss: 1.836 tex_dt_loss: 0.112
(epoch: 0, iters: 13) smoothed_total_loss: 1.292 total_loss: 8.835 kp_loss: 0.106 mask_loss: 0.073 vert2kp_loss: 6.465 deform_reg: 0.046 tri_loss: 0.076 cam_loss: 0.425 tex_loss: 1.565 tex_dt_loss: 0.177
(epoch: 0, iters: 14) smoothed_total_loss: 1.371 total_loss: 9.181 kp_loss: 0.102 mask_loss: 0.102 vert2kp_loss: 6.465 deform_reg: 0.045 tri_loss: 0.076 cam_loss: 0.602 tex_loss: 1.832 tex_dt_loss: 0.024
(epoch: 0, iters: 15) smoothed_total_loss: 1.445 total_loss: 8.843 kp_loss: 0.113 mask_loss: 0.089 vert2kp_loss: 6.465 deform_reg: 0.044 tri_loss: 0.071 cam_loss: 0.371 tex_loss: 1.823 tex_dt_loss: 0.031
(epoch: 0, iters: 16) smoothed_total_loss: 1.519 total_loss: 8.836 kp_loss: 0.112 mask_loss: 0.088 vert2kp_loss: 6.465 deform_reg: 0.043 tri_loss: 0.066 cam_loss: 0.437 tex_loss: 1.809 tex_dt_loss: 0.131
(epoch: 0, iters: 17) smoothed_total_loss: 1.593 total_loss: 8.941 kp_loss: 0.110 mask_loss: 0.083 vert2kp_loss: 6.465 deform_reg: 0.043 tri_loss: 0.069 cam_loss: 0.492 tex_loss: 1.778 tex_dt_loss: 0.134
(epoch: 0, iters: 18) smoothed_total_loss: 1.663 total_loss: 8.570 kp_loss: 0.108 mask_loss: 0.097 vert2kp_loss: 6.465 deform_reg: 0.042 tri_loss: 0.061 cam_loss: 0.468 tex_loss: 1.775 tex_dt_loss: 0.037
(epoch: 0, iters: 19) smoothed_total_loss: 1.727 total_loss: 8.017 kp_loss: 0.097 mask_loss: 0.108 vert2kp_loss: 6.465 deform_reg: 0.042 tri_loss: 0.064 cam_loss: 0.298 tex_loss: 1.786 tex_dt_loss: 0.087
(epoch: 0, iters: 20) smoothed_total_loss: 1.791 total_loss: 8.134 kp_loss: 0.100 mask_loss: 0.085 vert2kp_loss: 6.465 deform_reg: 0.041 tri_loss: 0.058 cam_loss: 0.448 tex_loss: 1.590 tex_dt_loss: 0.181

With Pytorch 1.0 and recent NMR: (epoch: 0, iters: 1) smoothed_total_loss: 0.132 total_loss: 13.186 kp_loss: 0.102 mask_loss: 0.099 vert2kp_loss: 6.465 deform_reg: 0.067 tri_loss: 0.202 cam_loss: 0.606 tex_loss: 1.779 tex_dt_loss: 0.121 (epoch: 0, iters: 2) smoothed_total_loss: 0.252 total_loss: 12.129 kp_loss: 0.092 mask_loss: 0.091 vert2kp_loss: 6.465 deform_reg: 0.065 tri_loss: 0.178 cam_loss: 0.614 tex_loss: 1.692 tex_dt_loss: 0.208 (epoch: 0, iters: 3) smoothed_total_loss: 0.367 total_loss: 11.773 kp_loss: 0.096 mask_loss: 0.094 vert2kp_loss: 6.465 deform_reg: 0.063 tri_loss: 0.175 cam_loss: 0.438 tex_loss: 1.587 tex_dt_loss: 0.197 (epoch: 0, iters: 4) smoothed_total_loss: 0.486 total_loss: 12.280 kp_loss: 0.099 mask_loss: 0.097 vert2kp_loss: 6.465 deform_reg: 0.062 tri_loss: 0.178 cam_loss: 0.585 tex_loss: 1.856 tex_dt_loss: 0.052 (epoch: 0, iters: 5) smoothed_total_loss: 0.603 total_loss: 12.217 kp_loss: 0.090 mask_loss: 0.095 vert2kp_loss: 6.465 deform_reg: 0.060 tri_loss: 0.183 cam_loss: 0.616 tex_loss: 1.652 tex_dt_loss: 0.273 (epoch: 0, iters: 6) smoothed_total_loss: 0.727 total_loss: 12.914 kp_loss: 0.108 mask_loss: 0.101 vert2kp_loss: 6.465 deform_reg: 0.059 tri_loss: 0.190 cam_loss: 0.576 tex_loss: 1.839 tex_dt_loss: 0.153 (epoch: 0, iters: 7) smoothed_total_loss: 0.849 total_loss: 12.935 kp_loss: 0.106 mask_loss: 0.095 vert2kp_loss: 6.465 deform_reg: 0.058 tri_loss: 0.196 cam_loss: 0.612 tex_loss: 1.643 tex_dt_loss: 0.079 (epoch: 0, iters: 8) smoothed_total_loss: 0.970 total_loss: 12.998 kp_loss: 0.107 mask_loss: 0.095 vert2kp_loss: 6.465 deform_reg: 0.057 tri_loss: 0.202 cam_loss: 0.493 tex_loss: 1.771 tex_dt_loss: 0.113 (epoch: 0, iters: 9) smoothed_total_loss: 1.091 total_loss: 13.090 kp_loss: 0.105 mask_loss: 0.080 vert2kp_loss: 6.465 deform_reg: 0.056 tri_loss: 0.209 cam_loss: 0.504 tex_loss: 1.705 tex_dt_loss: 0.100 (epoch: 0, iters: 10) smoothed_total_loss: 1.217 total_loss: 13.662 kp_loss: 0.118 mask_loss: 0.086 vert2kp_loss: 6.465 deform_reg: 0.055 tri_loss: 0.215 cam_loss: 0.486 tex_loss: 1.794 tex_dt_loss: 0.085 (epoch: 0, iters: 11) smoothed_total_loss: 1.337 total_loss: 13.174 kp_loss: 0.106 mask_loss: 0.089 vert2kp_loss: 6.465 deform_reg: 0.055 tri_loss: 0.220 cam_loss: 0.376 tex_loss: 1.691 tex_dt_loss: 0.086 (epoch: 0, iters: 12) smoothed_total_loss: 1.460 total_loss: 13.701 kp_loss: 0.102 mask_loss: 0.112 vert2kp_loss: 6.465 deform_reg: 0.054 tri_loss: 0.226 cam_loss: 0.534 tex_loss: 1.931 tex_dt_loss: 0.044 (epoch: 0, iters: 13) smoothed_total_loss: 1.581 total_loss: 13.499 kp_loss: 0.104 mask_loss: 0.093 vert2kp_loss: 6.465 deform_reg: 0.053 tri_loss: 0.233 cam_loss: 0.368 tex_loss: 1.669 tex_dt_loss: 0.179 (epoch: 0, iters: 14) smoothed_total_loss: 1.700 total_loss: 13.479 kp_loss: 0.093 mask_loss: 0.095 vert2kp_loss: 6.465 deform_reg: 0.052 tri_loss: 0.237 cam_loss: 0.452 tex_loss: 1.717 tex_dt_loss: 0.123 (epoch: 0, iters: 15) smoothed_total_loss: 1.827 total_loss: 14.385 kp_loss: 0.109 mask_loss: 0.114 vert2kp_loss: 6.465 deform_reg: 0.052 tri_loss: 0.242 cam_loss: 0.571 tex_loss: 1.867 tex_dt_loss: 0.049 (epoch: 0, iters: 16) smoothed_total_loss: 1.948 total_loss: 14.014 kp_loss: 0.102 mask_loss: 0.100 vert2kp_loss: 6.465 deform_reg: 0.051 tri_loss: 0.247 cam_loss: 0.456 tex_loss: 1.790 tex_dt_loss: 0.025 (epoch: 0, iters: 17) smoothed_total_loss: 2.075 total_loss: 14.595 kp_loss: 0.116 mask_loss: 0.100 vert2kp_loss: 6.465 deform_reg: 0.051 tri_loss: 0.253 cam_loss: 0.414 tex_loss: 1.886 tex_dt_loss: 0.063 (epoch: 0, iters: 18) smoothed_total_loss: 2.198 total_loss: 14.427 kp_loss: 0.104 mask_loss: 0.101 vert2kp_loss: 6.465 deform_reg: 0.050 tri_loss: 0.255 cam_loss: 0.471 tex_loss: 1.858 tex_dt_loss: 0.109 (epoch: 0, iters: 19) smoothed_total_loss: 2.319 total_loss: 14.267 kp_loss: 0.106 mask_loss: 0.086 vert2kp_loss: 6.465 deform_reg: 0.050 tri_loss: 0.260 cam_loss: 0.386 tex_loss: 1.562 tex_dt_loss: 0.051 (epoch: 0, iters: 20) smoothed_total_loss: 2.442 total_loss: 14.568 kp_loss: 0.106 mask_loss: 0.099 vert2kp_loss: 6.465 deform_reg: 0.049 tri_loss: 0.263 cam_loss: 0.414 tex_loss: 1.706 tex_dt_loss: 0.173

So, as can be seen, tri_loss starts off same in both cases but in Pytorch 1.0 version, tri_loss does not decrease as opposed to with Pytorch 0.3 version where tri_loss goes down from 0.205 to 0.058 in 20 iterations. In Pytorch 1.0 case, it almost seems like its slightly increasing. I trained this further and observed that the "tri_loss" increased to around 0.3 by 3rd epoch.

shubhtuls commented 5 years ago

Hi, Thanks for investigating this. My suspicion is that some update in NMR is giving more high frequency gradients to the shape (and therefore causing different behavior of the regularization loss). I would recommend training using the versions we mention in the CMR repository to best reproduce results.

If that is not feasible, you could perhaps look at the commits to rasterize.py in NMR repository, and selectively undo the changes in (one or more of) the commits to try to get behavior consistent with ours. I think there have only been 3 commits that changed the rasterize.py since the version we used (see history here), and maybe undoing the changes there would help.

BornInWater commented 5 years ago

I undid the 2 commits right after version 1.1.0. The most recent commit is for compatibility with chainer 5.0. Undoing those 2 commits did not change anything. Same loss behavior as before. I think there must be implementational changes in chainer 5.0 and changes in rasterize.py to accommodate these changes is causing the weird behavior of "tri_loss".

Will try to install version 1.1.0 of NMR. Did not work before but lets see.

BornInWater commented 5 years ago

Update: Tried version 1.1.0 of NMR with cupy 2.3 and chainer 3.3.0. Did not work. Tried with NMR 1.1.0, Chainer 4.4.0 and Cupy 4.4.0. Got same loss problems as before. So, I reckon the problem is in the implementation changes in Chainer rather than CMR.

BornInWater commented 5 years ago

Works on a different GPU. [K80]. The exact versions required cant be installed on Tesla V100.

xuluyue commented 5 years ago

@BornInWater hello,could you tell me more details about how to train this? Do you use it in car?

akanazawa / cmr

Snapshot generated after training not generating same results as provided snapshot #8