NVlabs / imaginaire

NVIDIA's Deep Imagination Team's PyTorch Library
Other
3.99k stars 444 forks source link

[wc_vid2vid] Style in the result suddenly changes #47

Closed lizuoyue closed 3 years ago

lizuoyue commented 3 years ago

Thanks for sharing the code of the amazing work. Currently, I have used it for training a model on our own dataset but am now facing some problems.

I simply create a small training set of around 50 samples and test on the training set just for learning to run the code and making it overfit. The training starts with the provided pre-trained checkpoint. The overall video generated is good except for the first two frames.

stuttgart_00_000000_000000_leftImg8bit stuttgart_00_000000_000001_leftImg8bit stuttgart_00_000000_000002_leftImg8bit stuttgart_00_000000_000003_leftImg8bit

These are the first 4 frames of a video generated. You can find that the 3rd and 4th frames overfit the ground truth very well but the 1st and 2nd frames seem like being generated by the original checkpoint (model).

Here are my questions: (1) Does anyone have an idea why this happens? Actually, I modified the code a little bit, but I am not sure whether this is the reason. What I modified is line 61 of https://github.com/NVlabs/imaginaire/blob/master/imaginaire/generators/wc_vid2vid.py, as originally Python raises an error that self.gen_cfg.single_image_model does not have attribute checkpoint. From the config file https://github.com/NVlabs/imaginaire/blob/master/configs/projects/wc_vid2vid/cityscapes/seg_ampO1.yaml, we can see that single_image_model does not have an attribute called checkpoint, either. Thus I simply set load_weights = False.

(2) Also, it seems that the final generated output does not use (copy) the original color from the guidance image at all, like letting the network generate the color of every pixel, am I wrong?

(3) Are depth images required for training/test? My dataset does not have depth images but there is no error running the code.

Look forward to receiving your reply. Thank you very much and wish you a happy new year.

arunmallya commented 3 years ago

Hi,

Let me explain how exactly the method is supposed to work: i) The network uses 3 conditioning maps - segmentation, optical flow-warped previous output, and guidance maps. (Note that the https://github.com/NVlabs/imaginaire/blob/master/configs/projects/wc_vid2vid/cityscapes/seg_ampO1.yaml config does not use depth maps). At time T, the optical flow is predicted using the frames outputted at times T-1 and T-2. ii) For the first two frames, we do not have optical flow predictions. For the first frame, we do not have guidance maps. We thus use a pretrained model that predicts output images from segmentation maps only for the first two frames (this is referred to as single_image_model). For every following frame, we use the 3 conditioning maps. iii) In our final checkpoint, the single_image_model is packaged into the provided checkpoint, while we loaded it from a different location during training.

As for your questions: (1) I'm not sure if you finetuned the single_image_model. As a result, it might be using the model provided with the checkpoint. I believe that's the case. The first two images you showed look like they are of the "cityscapes" style. (2) The 3 conditioning map model probably just overfits to your dataset after finetuning. At that point, it can choose to completely ignore the guidance maps. (3) The config used does not need depth maps.

If you are loading our pretrained checkpoint and finetuning it, the single_image_model will not be None and the error you received shouldn't occur. Otherwise the same error would occur during our eval scripts. So I suspect some edits you made probably set single_image_model to None after loading the pretrained checkpoint.