PWCNet Caffe training, loss not decreasing

gengshan-y commented 5 years ago

Hi, thanks for sharing your work.

I'm able to run your inference model and the results looks good. But I'm not able to train the PWCNet caffe model following the instructions.

I compiled Caffe from the official flownet2 repo, and trained using flyingchairs for 120k iterations, but the training loss does not decrease.

Moreover, the test result on KITTI is not reasonable. The standard deviation of the output is 0.02, which is too small.

I also tried to train Flownet2-CSS using the same caffe tool, which gives reasonable results. Could you share some pointers why it does not work? Thanks!

gengshan-y commented 5 years ago

I think the issue might be "propagate_down: false" in train.prototxt. The loss goes down after removing theses lines.

Lvhhhh commented 5 years ago

I think the issue might be "propagate_down: false" in train.prototxt. The loss goes down after removing theses lines.

do you mean the "propagate_down: false" in the DataAugmentation layer or the "Downsample" layer?

gengshan-y commented 5 years ago

I was wrong about this. Removing "propagate_down: false" is not reverent to the convergence. In their paper (https://arxiv.org/pdf/1809.05571.pdf)

However, we observe in our experiments that a deeper optical flow estimator might get stuck at poor local minima, which can be detected by checking the validation errors after a few thousand iterations and fixed by running from a different random initialization.

Have you tried changing the random seed or run it multiple times?

xianshunw commented 5 years ago

@gengshan-y what's your weighted loss in the last few steps?

gengshan-y commented 5 years ago

I did not go through the whole procedure. As I remember, the loss should decrease to around 20 after the first 100K iterations

xianshunw commented 5 years ago

@gengshan-y Thanks for your reply. My loss function degrades unreasonably, save I have tried to train the net from scratch several times, always get a similar result. how do you initialize the parameters and how do you set the random seed.

gengshan-y commented 5 years ago

Your loss looks correct to me. Does the output look meaningful? Are they all zeros? One trick I find useful for training is to normalise the features before correlation. After 2k iterations, you can always get meaningful results, and then you can remove the normalisation layer.

xianshunw commented 5 years ago

But after the first 100k iterations, my loss only decreases to around 50.

gengshan-y commented 5 years ago

I'm not sure about the exact number, but as long as your validation error on KITTI-15 matches those in their paper after 1200k iterations, it should be fine.

xianshunw commented 5 years ago

Thanks. I will test the model when the training is finished.

lelelexxx commented 5 years ago

@gengshan-y how did you solve your problem? I reimplemented a mxnet version of pwcnet, and it fitting well in MPI-Sintel, but seems not convergence in Flyingchairs.

gengshan-y commented 5 years ago

Before correlation, dividing each feature vector by its norm solves the problem for me

lelelexxx commented 5 years ago

@gengshan-y Thanks a lot, I loaded the pytorch-version weights into my mxnet model as initialization and problem solved. Although it is still weird that when i load the pytorch weights into mxnet, the outputs of pwcnet-mxnet version are extremly large

shimhyeonwoo commented 5 years ago

I have same problem. I edited train.prototxt :

directory of lmdb
crop size to width 448, height 384
batch size : 4 (I trained on 2 gpus)

but It didn't converge and test epe error on chairs test set (labeld 2 in FlyingChairs_release_test_train_split.list) were 10.9164379646

last few error log :

W0710 06:02:02.937400 3664 augmentation_layer_base.cpp:166] Augmentation: Exceeded maximum tries in finding spatial coeffs I0710 06:02:29.094774 3642 solver.cpp:229] Iteration 1199100, loss = 159.089 I0710 06:02:29.094890 3642 solver.cpp:245] Train net output #0: loss2 = 14451.4 ( 0.005 = 72.2572 loss) I0710 06:02:29.094897 3642 solver.cpp:245] Train net output #1: loss3 = 3581.35 ( 0.01 = 35.8135 loss) I0710 06:02:29.094902 3642 solver.cpp:245] Train net output #2: loss4 = 879.013 ( 0.02 = 17.5803 loss) I0710 06:02:29.094920 3642 solver.cpp:245] Train net output #3: loss5 = 213.887 ( 0.08 = 17.1109 loss) I0710 06:02:29.094925 3642 solver.cpp:245] Train net output #4: loss6 = 51.0297 ( 0.32 = 16.3295 loss) I0710 06:02:29.292071 3642 sgd_solver.cpp:106] Iteration 1199100, lr = 6.25e-06 I0710 06:03:09.780083 3642 solver.cpp:229] Iteration 1199200, loss = 175.056 I0710 06:03:09.780184 3642 solver.cpp:245] Train net output #0: loss2 = 15672 ( 0.005 = 78.3598 loss) I0710 06:03:09.780203 3642 solver.cpp:245] Train net output #1: loss3 = 3902.47 ( 0.01 = 39.0247 loss) I0710 06:03:09.780208 3642 solver.cpp:245] Train net output #2: loss4 = 969.49 ( 0.02 = 19.3898 loss) I0710 06:03:09.780213 3642 solver.cpp:245] Train net output #3: loss5 = 240.549 ( 0.08 = 19.2439 loss) I0710 06:03:09.780220 3642 solver.cpp:245] Train net output #4: loss6 = 59.499 ( 0.32 = 19.0397 loss) I0710 06:03:09.977468 3642 sgd_solver.cpp:106] Iteration 1199200, lr = 6.25e-06 I0710 06:03:50.425305 3642 solver.cpp:229] Iteration 1199300, loss = 205.94 I0710 06:03:50.425384 3642 solver.cpp:245] Train net output #0: loss2 = 18363.5 ( 0.005 = 91.8175 loss) I0710 06:03:50.425391 3642 solver.cpp:245] Train net output #1: loss3 = 4587.73 ( 0.01 = 45.8773 loss) I0710 06:03:50.425415 3642 solver.cpp:245] Train net output #2: loss4 = 1144.02 ( 0.02 = 22.8803 loss) I0710 06:03:50.425421 3642 solver.cpp:245] Train net output #3: loss5 = 284.788 ( 0.08 = 22.7831 loss) I0710 06:03:50.425426 3642 solver.cpp:245] Train net output #4: loss6 = 70.5754 ( 0.32 = 22.5841 loss) I0710 06:03:50.618796 3642 sgd_solver.cpp:106] Iteration 1199300, lr = 6.25e-06 W0710 06:04:11.786522 3664 augmentation_layer_base.cpp:166] Augmentation: Exceeded maximum tries in finding spatial coeffs I0710 06:04:31.033771 3642 solver.cpp:229] Iteration 1199400, loss = 124.741 I0710 06:04:31.033855 3642 solver.cpp:245] Train net output #0: loss2 = 11336.2 ( 0.005 = 56.681 loss) I0710 06:04:31.033874 3642 solver.cpp:245] Train net output #1: loss3 = 2795.96 ( 0.01 = 27.9596 loss) I0710 06:04:31.033879 3642 solver.cpp:245] Train net output #2: loss4 = 683.952 ( 0.02 = 13.679 loss) I0710 06:04:31.033900 3642 solver.cpp:245] Train net output #3: loss5 = 166.498 ( 0.08 = 13.3198 loss) I0710 06:04:31.033906 3642 solver.cpp:245] Train net output #4: loss6 = 40.9498 ( 0.32 = 13.1039 loss) I0710 06:04:31.226580 3642 sgd_solver.cpp:106] Iteration 1199400, lr = 6.25e-06 W0710 06:04:52.813073 3642 augmentation_layer_base.cpp:166] Augmentation: Exceeded maximum tries in finding spatial coeffs I0710 06:05:11.693806 3642 solver.cpp:229] Iteration 1199500, loss = 168.663 I0710 06:05:11.693944 3642 solver.cpp:245] Train net output #0: loss2 = 15118.7 ( 0.005 = 75.5933 loss) I0710 06:05:11.693969 3642 solver.cpp:245] Train net output #1: loss3 = 3764.21 ( 0.01 = 37.6421 loss) I0710 06:05:11.693974 3642 solver.cpp:245] Train net output #2: loss4 = 933.177 ( 0.02 = 18.6635 loss) I0710 06:05:11.693979 3642 solver.cpp:245] Train net output #3: loss5 = 231.306 ( 0.08 = 18.5045 loss) I0710 06:05:11.693984 3642 solver.cpp:245] Train net output #4: loss6 = 57.0683 ( 0.32 = 18.2618 loss) I0710 06:05:11.890733 3642 sgd_solver.cpp:106] Iteration 1199500, lr = 6.25e-06 I0710 06:05:52.241992 3642 solver.cpp:229] Iteration 1199600, loss = 161.913 I0710 06:05:52.242179 3642 solver.cpp:245] Train net output #0: loss2 = 14955.8 ( 0.005 = 74.779 loss) I0710 06:05:52.242188 3642 solver.cpp:245] Train net output #1: loss3 = 3660.92 ( 0.01 = 36.6092 loss) I0710 06:05:52.242209 3642 solver.cpp:245] Train net output #2: loss4 = 884.83 ( 0.02 = 17.6966 loss) I0710 06:05:52.242214 3642 solver.cpp:245] Train net output #3: loss5 = 211.702 ( 0.08 = 16.9362 loss) I0710 06:05:52.242233 3642 solver.cpp:245] Train net output #4: loss6 = 49.6698 ( 0.32 = 15.8944 loss) I0710 06:05:52.438472 3642 sgd_solver.cpp:106] Iteration 1199600, lr = 6.25e-06 W0710 06:05:53.679702 3642 augmentation_layer_base.cpp:166] Augmentation: Exceeded maximum tries in finding spatial coeffs W0710 06:05:59.783059 3642 augmentation_layer_base.cpp:166] Augmentation: Exceeded maximum tries in finding spatial coeffs W0710 06:06:05.104701 3664 augmentation_layer_base.cpp:166] Augmentation: Exceeded maximum tries in finding spatial coeffs I0710 06:06:32.874910 3642 solver.cpp:229] Iteration 1199700, loss = 137.294 I0710 06:06:32.875035 3642 solver.cpp:245] Train net output #0: loss2 = 12414.5 ( 0.005 = 62.0726 loss) I0710 06:06:32.875041 3642 solver.cpp:245] Train net output #1: loss3 = 3067.86 ( 0.01 = 30.6786 loss) I0710 06:06:32.875046 3642 solver.cpp:245] Train net output #2: loss4 = 754.708 ( 0.02 = 15.0942 loss) I0710 06:06:32.875053 3642 solver.cpp:245] Train net output #3: loss5 = 185.633 ( 0.08 = 14.8506 loss) I0710 06:06:32.875077 3642 solver.cpp:245] Train net output #4: loss6 = 45.6276 ( 0.32 = 14.6008 loss) I0710 06:06:33.069972 3642 sgd_solver.cpp:106] Iteration 1199700, lr = 6.25e-06 I0710 06:07:13.479297 3642 solver.cpp:229] Iteration 1199800, loss = 226.94 I0710 06:07:13.479460 3642 solver.cpp:245] Train net output #0: loss2 = 20334 ( 0.005 = 101.67 loss) I0710 06:07:13.479467 3642 solver.cpp:245] Train net output #1: loss3 = 5067.48 ( 0.01 = 50.6748 loss) I0710 06:07:13.479473 3642 solver.cpp:245] Train net output #2: loss4 = 1259.22 ( 0.02 = 25.1844 loss) I0710 06:07:13.479478 3642 solver.cpp:245] Train net output #3: loss5 = 311.336 ( 0.08 = 24.9069 loss) I0710 06:07:13.479486 3642 solver.cpp:245] Train net output #4: loss6 = 76.5835 ( 0.32 = 24.5067 loss) I0710 06:07:13.675122 3642 sgd_solver.cpp:106] Iteration 1199800, lr = 6.25e-06 W0710 06:07:31.177048 3642 augmentation_layer_base.cpp:166] Augmentation: Exceeded maximum tries in finding spatial coeffs W0710 06:07:31.207374 3664 augmentation_layer_base.cpp:166] Augmentation: Exceeded maximum tries in finding spatial coeffs W0710 06:07:31.222362 3664 augmentation_layer_base.cpp:166] Augmentation: Exceeded maximum tries in finding spatial coeffs W0710 06:07:39.711684 3642 augmentation_layer_base.cpp:166] Augmentation: Exceeded maximum tries in finding spatial coeffs W0710 06:07:51.502991 3642 augmentation_layer_base.cpp:166] Augmentation: Exceeded maximum tries in finding spatial coeffs I0710 06:07:54.136116 3642 solver.cpp:229] Iteration 1199900, loss = 94.6323 I0710 06:07:54.136143 3642 solver.cpp:245] Train net output #0: loss2 = 8697.5 ( 0.005 = 43.4875 loss) I0710 06:07:54.136152 3642 solver.cpp:245] Train net output #1: loss3 = 2142.56 ( 0.01 = 21.4256 loss) I0710 06:07:54.136175 3642 solver.cpp:245] Train net output #2: loss4 = 520.136 ( 0.02 = 10.4027 loss) I0710 06:07:54.136183 3642 solver.cpp:245] Train net output #3: loss5 = 124.719 ( 0.08 = 9.97753 loss) I0710 06:07:54.136205 3642 solver.cpp:245] Train net output #4: loss6 = 29.1918 (* 0.32 = 9.34139 loss) I0710 06:07:54.333436 3642 sgd_solver.cpp:106] Iteration 1199900, lr = 6.25e-06 I0710 06:08:34.552665 3642 solver.cpp:456] Snapshotting to binary proto file flow_iter_1200000.caffemodel I0710 06:08:34.735556 3642 sgd_solver.cpp:273] Snapshotting solver state to binary proto file flow_iter_1200000.solverstat I0710 06:08:34.907498 3642 solver.cpp:318] Iteration 1200000, loss = 183.932 I0710 06:08:34.907518 3642 solver.cpp:323] Optimization Done. I0710 06:08:35.175837 3642 caffe.cpp:222] Optimization Done.

lhao0301 commented 4 years ago

Just met the same problem. Trained on FlyingChairs with the provided train.prototxt, the output is near all zero. Would you mind giving any suggestions on it? @gengshan-y @shimhyeonwoo @xianshunw

gengshan-y commented 4 years ago

However, we observe in our experiments that a deeper optical flow estimator might get stuck at poor local minima, which can be detected by checking the validation errors after a few thousand iterations and fixed by running from a different random initialization.

This works for me, as suggested in their paper.

lhao0301 commented 4 years ago

However, we observe in our experiments that a deeper optical flow estimator might get stuck at poor local minima, which can be detected by checking the validation errors after a few thousand iterations and fixed by running from a different random initialization.

This works for me, as suggested in their paper.

Thanks！ It seems a little magic.

NVlabs / PWC-Net

PWCNet Caffe training, loss not decreasing #60