Open ztrobertyang opened 2 years ago
Hi, I am not one of the authors of this paper, but it seems you misunderstand the ideas that motivate implicit representation, and in particular, implicit representation for video inpainting.
Long story short, this is a zero-shot task. There is no "training" and "testing" in the traditional ML sense. The model receives as inputs the frames and some masks (as few as one mask!) for inpainting. It iterates on those frames, learning how to inpaint for that specific video, until convergence, which is when it delivers a satisfactory inpainting. It doesn't need general visual priors from expensive pretraining- this is not desirable. As an implicit method, the visual priors the network learns from the video of interest are sufficient.
Hi, Thanks for providing code. I look at your code, I find you train one video, and then use the same to do the inference. I think it is tricky. The CNN should be train with multiple videos, and then use different video to do inference. Same video to train and same video to inference, it is of cause produce a good result.
I try to understand your model. Is kind like the input video has an foreground object and its mask, then, you give another mask for augmentation. Then, after training, the output video will have no foreground and the background is re-drawed. Am I right?
Can I train your model with multiple videos then inference with a different video? For example, I want to train 10 different videos, then I inference with another different video? What will be happened on the inference? How can I train the model with multiple videos?