Closed AIRedWood closed 5 years ago
or TwoFrame Motion Estimation Based on Polynomial Expansion?
@AIRedWood Original paper of VESPCN; and CVPR17 Detail-revealing Deep Video Super-resolution, whose testing model is open sourced.
Actually, most modern flow estimation network (such as Flownet 1/2, PWC-Net, FlowLite...) never uses "normalized" flow, they always employ absolute value.
Perhaps it's unwise to constraint the movement to just 1 pixel, because it's quite unnatural, combining flownet2 or PWC-net may be a possible way...
I read the paper FRVER, and then I found that the Fnet inside is just simple convolution - convolution -.... - convolution - tanh, but VESPCN uses Coarse - to - fine warping techniques, and I don't know which one is better.
I'll look at the papers Flownet and PWC-net later.
And I'm thinking about a problem. VESPCN uses W [-1, 1] to indicate that the maximum displacement is W (picture size), while the modification to [-1, 1] actually makes W = 1, which means that the maximum displacement does not exceed 1, so it should be possible to set a parameter 1 < x < W, x [-1, 1] to indicate that the maximum displacement does not exceed X. X can be 2,3,....
@LoSealL
@AIRedWood Let's discuss on two cats:
In many deep-learning optical flow networks, there's a major benchmark sintel. Currently, top methods output optical flow in a pyramid way and their last level is the full or half size of the optical flow. Detailly, the final output layer is usually a conv2d
without activation, and the objective function is mainly EPE (end-point error, similar to L2).
While their outputs are representing absolute movement pixels on 2-D directions.
Now let's see VSR's optical flow sub-network. In the early work such as VESPCN, a small net is used, the last layer is conv2d with tanh
activation. tanh
maps outputs to (-1, 1). The same architecture is used in SPMC, ICCV17 and FRVSR, CVPR18. What's different, only in VESPCN that the author explicitly interpreted the meaning of tanh
(Section 2.3, last sentence in 3rd par):
Output activations use tanh to represent pixel displacement in normalised space, such that a displacement of ±1 means maximum displace- ment from the center to the border of the image.
While other papers didn't mention that. In the released model of SPMC, I found they warp the flow directly, without denormalize it, so they treat the tanh
output as a constraint to limit movement no more than 1 pixel in LR space.
It's worth noting that 1 pixel limits in LR space means 4-pixel limits in HR space (for x4 up-scale), so I think it's not that bad 'cause 4 pixels in HR space is an acceptable slow movement, that's why you can find there's no very fast movement in VSR benchmark.
Back to your idea, it's a good strategy to use coarse-to-fine regression, but be aware that spatial transform network in VESPCN is only 10-layer depth, for comparison, FRVSR has 14 CNN layers and PWC has more than 50 CNN layers. My point, extend limits to pixel displacement is OK, while the capacity of flow network should be increased.
Hello, I tested the changes of loss function and loss_me function with the number of iterations when normalized is True or False. Then I found that when normalized is False, it converges significantly faster than when True. It's really amazing, so I want to explore the reasons in detail. Is the reference paper you mentioned in issue 10 High Accuracy Optical Flow Estimation Based on a Theory for Warping?