This introduces problematic behavior, because in the training condition, flow is estimated on (256, 256) patches, but in the testing condition, it is estimated on (1152, 1920) frames(regarding paddings), resulting much larger flow than expected.
Applying this workaround, model's R-D curve goes much higher than before, showing similar result with authors'.
python3 -m compressai.utils.video.plot -f results/video/UVG-1080p/ssf* -o outputs/fig.png
(ssf2020-mse is the one that used the workaround.)
Hi, I think there is an issue on the SSF model implementation which prevents model to get appropriate R-D result. In the code, https://github.com/InterDigitalInc/CompressAI/blob/743680befc146a6d8ee7840285584f2ce00c3732/compressai/models/video/google.py#L354-L371 estimated flow is directly used to F.grid_sample function. It should have no problem when grid is in the absolute coordinate, but actually grid is in relative coordinate, which left-top corner is [-1, -1] and right-bottom corner is [1, 1].
This introduces problematic behavior, because in the training condition, flow is estimated on (256, 256) patches, but in the testing condition, it is estimated on (1152, 1920) frames(regarding paddings), resulting much larger flow than expected.
A quick workaround is to apply some weightings regarding train-test size changes, like: https://github.com/sybahk/CompressAI/commit/df138a92f8a59311cc581ef1580a24e40e1ae986
And I ran evaluation using this command :
python3 -m compressai.utils.video.eval_model pretrained $UVG_PATH outputs -a ssf2020 -q 1,2,3,4 -o ssf2020-mse-ans-vimeo-modified.json
https://github.com/sybahk/CompressAI/commit/b9f56100319bc61913623b0f8b0a818a822dbbdcApplying this workaround, model's R-D curve goes much higher than before, showing similar result with authors'.
python3 -m compressai.utils.video.plot -f results/video/UVG-1080p/ssf* -o outputs/fig.png
(ssf2020-mse is the one that used the workaround.)When using pretrained model, applying the workaround is just fine, but when training a new model, I think we should consider input size from training time like DCVC does: https://github.com/microsoft/DCVC/blob/4df94295c8dbe0a26456582d1a0eddb3465f1597/DCVC-TCM/src/models/video_net.py#L83-L94