About the use of a parallel criterion

The question is not easy to answer. What is certain is that convergence is worse when using only the last loss. It can be interpreted (not rigorously explained though) as the fact that every flow generator layer deals with a particular scale, and then a particular frequency. Low Scale is designed for large displacement, and thus high resolution layer only deals with high frequency shapes and flow values. having a flowmap upscaled from a lower scale acts as a canevas to which the upper layer will add values for finetuning instead of infering the whole flowmap at once.

I would not be surprirsed if something simpler not requiring multi scale loss (say GAN training that would train the network to do shapes identification to get a better flow map, as it has been used here for semantic segmentation) could outperform this technique.

ClementPinard / FlowNetTorch

About the use of a parallel criterion #10