NVlabs / PWC-Net

PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume, CVPR 2018 (Oral)
Other
1.63k stars 357 forks source link

Warping with ground truth flow instead of predicted flow #51

Closed MrRoboticist closed 5 years ago

MrRoboticist commented 5 years ago

Hi,

My question is about the operation used to warp the second image towards the first image. In your work, you choose to use upsampled flow prediction from the upper pyramid level to warp the image in the level below. However, won't this result in incorrect training signals being passed during the training phase?

For instance, consider the following: Let's focus on some one pixel (say A) in image 1. We are looking for the appropriate pixel in image 2 which should be warped to A's location. Let C be the true matching for A (that is, the pixel which is A in image 1 has moved to pixel C in image 2). However, while training (because our flow predictions are not perfect), the upsampled flow prediction takes us to a different point B as the point to be warped. warping problem (Let a.b denote the correlation operation between feature vectors at A and B). Now we compute the cost volume as a.b, and this is used to predict the flow. The training signal that goes back to the flow estimator network and the feature extractor network is incorrect as the gradient is computed using the wrong input (a.b) --- it would be the correct signal if the cost volume were computed as a.c.

In essence, this is a problem because you backpropagate the errors using only subgradients (as in the Spatial Transformers paper) rather than the full gradients. Thus, the flow predictor network updates its weights assuming that the input to it is a.b, whereas the correct input to it should have been a.c. This problem is also described in section 3 of the paper Occlusion Aware Unsupervised Learning of Optical Flow under the heading Backward warping with a larger search space.

That paper also proposes a different warping mechanism to choose the pixel as close to the true warping as possible. However, I was thinking that to solve this problem, can't we just use the ground truth flow for warping the image while training? This should not make a difference because the warping layer does not have any trainable weights, so the network will still be trained end-to-end. More importantly, using ground truth for warping will ensure that the error and gradient signals are computed using the true warped point, so the training signal propagated back will be correct. What are your thoughts on this?

jrenzhile commented 5 years ago

@MrRoboticist while I'm not the author of PWC-Net, I'm aware of the paper from Baidu, and I do think this is a great observation. It would be nice to implement your proposed method and benchmark on public datasets, you might want to refer to the longer journal version of the paper to get more ideas of the training protocols, etc https://arxiv.org/pdf/1809.05571.pdf

deqings commented 5 years ago

That's a very good observation about the limitation of the PWC-Net architecture, i.e., that ground truth may not be covered by the search range. This happens when you have very small fast moving objects, like a tennis ball or soccer ball. However, I suspect that warping with ground truth during training would solve the problem because the ground truth is not available during the test phase. You have to use the upsampled flow during the test phase. Making the training and test protocols consistent would have the optimal performance in an end-to-end setting (suppose no over-fitting issue). My guess is that for most cases, C and B are relatively close and a search range of 9x9 windows would cover C during the refinement phase at each pyramid level.

MrRoboticist commented 5 years ago

Thank you for sharing your thoughts.

However, I suspect that warping with ground truth during training would solve the problem because the ground truth is not available during the test phase.

Did you mean warping with ground truth will not solve the problem?

You have to use the upsampled flow during the test phase.

Agreed, there is no substitute for upsampled flow prediction for warping during inference.

Making the training and test protocols consistent would have the optimal performance in an end-to-end setting (suppose no over-fitting issue).

Perhaps I'm missing something, but I don't see how this will affect overfitting (or the inference pipeline at all)? It is just a means to speed up convergence (by dropping incorrect gradient signals) during the training phase.

My guess is that for most cases, C and B are relatively close and a search range of 9x9 windows would cover C during the refinement phase at each pyramid level.

Agreed again. I suspect that this is the case in practice, which is why PWC-Net does converge to a low error.

But coming back to this:

... that ground truth may not be covered by the search range.

My point is slightly different, which I wanted to get your opinion on: warping with intermediate predictions leads to incorrect training signals even if the ground truth point is within the search range, but if the predicted flow is not perfect – a scenario which is very likely while training. One way this could be tackled is by backpropagating the full gradient of the warping operation.

But since PWC-Net works with subgradients, this problem is not addressed and the loss (and subsequently the derivative) is computed w.r.t. point B when it should have been computed w.r.t. point C. Due to this, training time is longer and perhaps even results in some drop in accuracy. It is this problem that I was thinking can be overcome by warping with the ground truth, which imo should not affect the test pipeline – mainly because the warping layer does not have any trainable weights.

That is, we can use the ground truth flow to warp the image while training, which will ensure that the flow predictions get better. Later during inference, we use these trained flow predictions to warp the images, which shouldn't make a difference to the warping mechanism as it can work the same as for the training stage.