Problems when running the code

Hi, Thanks so much for releasing the code! Unfortunately, the code seems untested, as there are many typo's which raise runtime errors (e.g. uitls and datset throw exceptions). Could you provide instructions how to run the code on Cityscapes and KTH? It is unclear how all the paths should be set in order to run on Cityscapes. For example, the data loader throws the following error. Seems there is just a path missing from the function call. Some assistance in how to run the code would be very helpful. Thank you very much!

  File "train_refine_multigpu.py", line 173, in <module>
    a = flowgen(opt)
  File "train_refine_multigpu.py", line 43, in __init__
    train_Dataset = get_training_set(opt)
  File "/workplace/code/seg2vid/src/dataset.py", line 12, in get_training_set
    size=opt.input_size, split='train', split_num=1, num_frames=opt.num_frames
TypeError: __init__() missing 1 required positional argument: 'mask_data_path'

Furthermore, I have the following questions about the relationship between the paper and code.

Occlusion Masks. The paper reads:

We define a pixel value in the occlusion map to be zero when there is no correspondence between frames. All optical flows and occlusion maps are jointly predicted by our image-to-flow module.

This is implemented with the code below:

class get_occlusion_mask(nn.Module):
    def __init__(self):
        super(get_occlusion_mask, self).__init__()
        self.main = nn.Sequential(
            upconv(64, 16, 5, 1, 2),
            nn.Conv2d(16, 2, 5, 1, 2),
        )

    def forward(self, x):
        return torch.sigmoid(self.main(x))

# ...

masks = torch.cat(self.get_mask(flow_deco4).unsqueeze(2).chunk(opt.num_predicted_frames, 0), 2)

I was wondering whether you can give some additional motivation on this part of the paper. How does the combination of an Conv2D and sigmoid represent the correspondence between the frames?

Flow Decoder. In Section 4.2 the paper reads:

For the flow encoder, we use three blocks each consisting of 3D convolutional layers intercepted with bilinear upsampling layer that progressively recovers the input resolution in both spatial and temporal dimensions.

Note: "encoder" should be "decoder" here I think?

This decoder seems to be implemented here: https://github.com/STVIR/seg2vid/blob/junting/src/models/multiframe_w_mask_genmask.py#L164 -- However, I was wondering whether you can comment on the use of 2D versus 3D convolutions? It seems like in the code there are no 3D convolutions. If I am reading the code correctly, you are stacking the temporal dimension into image channels are subsequently process the frames using 2D convolutions on the multi-channel images. Only the final blocks gateconv3d use 3D convolutions, the upconv blocks use 2D convolutions with upsampling.

Thanks for helping me understand your CVPR paper and have fun at the conference next week!

Hi, Thanks so much for releasing the code! Unfortunately, the code seems untested, as there are many typo's which raise runtime errors (e.g. uitls and datset throw exceptions). Could you provide instructions how to run the code on Cityscapes and KTH? It is unclear how all the paths should be set in order to run on Cityscapes. For example, the data loader throws the following error. Seems there is just a path missing from the function call. Some assistance in how to run the code would be very helpful. Thank you very much!
  File "train_refine_multigpu.py", line 173, in <module>
    a = flowgen(opt)
  File "train_refine_multigpu.py", line 43, in __init__
    train_Dataset = get_training_set(opt)
  File "/workplace/code/seg2vid/src/dataset.py", line 12, in get_training_set
    size=opt.input_size, split='train', split_num=1, num_frames=opt.num_frames
TypeError: __init__() missing 1 required positional argument: 'mask_data_path'

Dear Tom, Thanks for your comments! Sorry for the errors and typos. We have uploaded a new commit with all bug fixed. And we also update a new version of readme about how the dataset path should be set.

Furthermore, I have the following questions about the relationship between the paper and code.

Occlusion Masks. The paper reads:

We define a pixel value in the occlusion map to be zero when there is no correspondence between frames. All optical flows and occlusion maps are jointly predicted by our image-to-flow module.

This is implemented with the code below:
class get_occlusion_mask(nn.Module):
    def __init__(self):
        super(get_occlusion_mask, self).__init__()
        self.main = nn.Sequential(
            upconv(64, 16, 5, 1, 2),
            nn.Conv2d(16, 2, 5, 1, 2),
        )

    def forward(self, x):
        return torch.sigmoid(self.main(x))

# ...

masks = torch.cat(self.get_mask(flow_deco4).unsqueeze(2).chunk(opt.num_predicted_frames, 0), 2)
I was wondering whether you can give some additional motivation on this part of the paper. How does the combination of an Conv2D and sigmoid represent the correspondence between the frames?

Flow Decoder. In Section 4.2 the paper reads:

For the flow encoder, we use three blocks each consisting of 3D convolutional layers intercepted with bilinear upsampling layer that progressively recovers the input resolution in both spatial and temporal dimensions.

Note: "encoder" should be "decoder" here I think?

This decoder seems to be implemented here: https://github.com/STVIR/seg2vid/blob/junting/src/models/multiframe_w_mask_genmask.py#L164 -- However, I was wondering whether you can comment on the use of 2D versus 3D convolutions? It seems like in the code there are no 3D convolutions. If I am reading the code correctly, you are stacking the temporal dimension into image channels are subsequently process the frames using 2D convolutions on the multi-channel images. Only the final blocks gateconv3d use 3D convolutions, the upconv blocks use 2D convolutions with upsampling.

Thanks for helping me understand your CVPR paper and have fun at the conference next week!

Regarding the first question, the occlusion mask is learned by optimizing a loss function from the sequence of frames in order to find correspondence between them.

About the second question, you are right. The encoder should be the decoder. In the decoder, we combine both 2D and 3D convolutional layers.

STVIR / seg2vid

Problems when running the code #1