Video inpainting using a sequence of masks

AHHHZ975 commented 10 months ago

Hi Shiwei, @Steven-SWZhang

Thank you for publically making available the great work you have done. I have been trying to reproduce the results for the task "video inpainting using a sequence of masks". More specifically, I have a video including 10 frames and 10 masks corresponding to those 10 frames of videos. So, I would like to feed the video alongside the sequence of masks and a text prompt to the model. So, I expect to get a temporally consistent video as an output in a way that the output video adheres to the sequence of masks and the text prompt.

However, I could not see any argument for the input mask. So, I went through the code, and as far as I understood, it seems that the code itself generates a random mask on the input video. The code below (inference_single.py) shows my explanation:

in which the function "make_masked_images" is:

So, as far as I realized, this "mask" variable in line 564 of the first snapshot is initialized with the "batch" variable (which comes from the dataloader) in the picture below:

So, when I went through the "dataset.py" code, I found out the mask is somehow randomly generated as the following:

So, my understanding is that the code only conditions the model on this randomly generated mask. So, if my understanding is correct, does it mean that we cannot feed an external sequence of masks to the model? If the understanding is not correct, I would appreciate it if you could explain how I can feed the sequence of masks to the model as I could not find anything in the code.

Thank you in advance for putting time into this case.

Kind Regards, Amir

AHHHZ975 commented 10 months ago

Hello, I would appreciate it if someone could help me with this issue, please. Best, Amir

Zeldalina commented 7 months ago

同样的的问题求教。

InkosiZhong commented 3 weeks ago

I have tried to support customized mask sequence by modifying the implementation of __getitem__ in VideoDataset. However, I observe that the make_masked_images is some kind weird.

def make_masked_images(imgs, masks):
    masked_imgs = []
    for i, mask in enumerate(masks):        
        # concatenation
        masked_imgs.append(torch.cat([imgs[i] * (1 - mask), (1 - mask)], dim=1))
    return torch.stack(masked_imgs, dim=0)

# line 562-564
if 'mask' in cfg.video_compositions:
    masked_video = make_masked_images(misc_data.sub(0.5).div_(0.5), mask)
    masked_video = rearrange(masked_video, 'b f c h w -> b c f h w')

It first normalizes the video sequence to $[-1,1]$, and then uses make_masked_images to set the masked pixels to $0$. Normally, should we multiply by mask first and then normalize? Is this a design or a bug?

ali-vilab / videocomposer

Video inpainting using a sequence of masks #40