Closed ptriantd closed 7 years ago
The question is not easy to answer. What is certain is that convergence is worse when using only the last loss. It can be interpreted (not rigorously explained though) as the fact that every flow generator layer deals with a particular scale, and then a particular frequency. Low Scale is designed for large displacement, and thus high resolution layer only deals with high frequency shapes and flow values. having a flowmap upscaled from a lower scale acts as a canevas to which the upper layer will add values for finetuning instead of infering the whole flowmap at once.
I would not be surprirsed if something simpler not requiring multi scale loss (say GAN training that would train the network to do shapes identification to get a better flow map, as it has been used here for semantic segmentation) could outperform this technique.
This is a question irrelevant to the implementation, but related to the concept used. Why is a parallel criterion needed? Isn't it enough to train the network using only the last loss?