facebookresearch / pytorchvideo

A deep learning library for video understanding research.
https://pytorchvideo.org/
Apache License 2.0
3.29k stars 410 forks source link

RuntimeError: stack expects each tensor to be equal size, but got [3, 61, 864, 1152] at entry 0 and [3, 60, 864, 1152] at entry 1 #61

Open mrclasalvia opened 3 years ago

mrclasalvia commented 3 years ago

RuntimeError: stack expects each tensor to be equal size, but got [3, 61, 864, 1152] at entry 0 and [3, 60, 864, 1152] at entry 1

Hi everyone! I am getting the aforementioned error when using a custom dataset, does anyone know why? I assumed it was due to videos having different framerate ( so the Time - T - dimension would be different from one clip sample to the other) therefore I preprocessed them to have the same frame rate. However, it still does not work.

Thanks.

nicklaslund commented 3 years ago

It is because of exactly what you write, the elements in the batch do not have the same size in the temporal dimension and torch.stack() expects tensors to have equal dimensions. So the problem most likely is in your pre-processing step when sampling frames.

mrclasalvia commented 3 years ago

Thank you, I made sure that now all the videos have the same frame rate and also, I resized the Height and width of the frames to be equal to 224x224. however, it still tells me that I am loading 3,61,864,1152. What is it loading?? I mean is it right if I use "Kinetics" to load a custom dataset or should I use something else?

Thank you again for the advice and help

kalyanvasudev commented 3 years ago

HI @mrclasalvia , could you please provide more context about your code here. Could you also share your whole stack trace and the code snippet that you used for the batch generation (including your dataloader instantiation and data transforms) ? Thanks!

trigal commented 3 years ago

mmm just curious... which model are you using? disclaimer: not sure if this is related with my case... :( I have a similar issue with flowfast I can't realize.. but is not an issue related with the data-loader (i think) because the shape of inputs just before the y_hat = self.model(x) are

batch[self.batch_key] ->
batch[self.batch_key][0].shape = torch.Size([16, 3, 5, 224, 224])
batch[self.batch_key][1].shape = torch.Size([16, 3, 15, 224, 224])

where the "original" is the second one and the first is created with PackPathway with alpha set to 3.

I think (obviously) I am doing something wrong but I don't understand where neither ... and for slowfast in this repo I can't find a working example for the training phase with a "custom" dataset (I adapted the charades dataloader in my case, but i don't think this is the matter)

some more details: within the forward of MultiPathWayWithFuse,

self.multipathway_blocks[pathway_idx](x[0]).shape : torch.Size([16, 64, 5, 56, 56])
self.multipathway_blocks[pathway_idx](x[1]).shape : torch.Size([16, 64, 15, 56, 56])

and in the forward of slowfast.py

x_s.shape : torch.Size([16, 64, 5, 56, 56])
x_f.shape :  torch.Size([16, 8, 15, 56, 56])

and the offending line is then

x_s_fuse = torch.cat([x_s, fuse], 1)
kalyanvasudev commented 3 years ago

@trigal, thanks for the additional details? Are you training a new slowfast model from scratch? If so, could you please provide your model instantiation too?

@haooooooqi , could you also please take a look at this issue/ Thanks!

trigal commented 3 years ago

@trigal, thanks for the additional details? Are you training a new slowfast model from scratch? If so, could you please provide your model instantiation too?

@kalyanvasudev yes I'm training from scratch and I used the pytorch-lightning model that comes inside the video_classification_example, changing some little things to adapt the code to my needs.

Basically I changed the beginning of VideoClassificationLightningModule in this way

        #############
        # PTV Model #
        #############

        # Here we construct the PyTorchVideo model. For this example we're using a
        # ResNet that works with Kinetics (e.g. 400 num_classes). For your application,
        # this could be changed to any other PyTorchVideo model (e.g. for SlowFast use
        # create_slowfast).
        if self.args.arch == "video_resnet":
            self.batch_key = "video"
            if self.args.selectnet == 'RESNET3D':
                self.model = pytorchvideo.models.resnet.create_resnet(input_channel=3, model_num_class=7)
            elif self.args.selectnet == 'X3D':
                # self.model = pytorchvideo.models.x3d.create_x3d(model_num_class=7,
                #                                                 input_clip_length=self.args.clip_duration,
                #                                                 input_crop_size=224)
                self.model = pytorchvideo.models.x3d.create_x3d(model_num_class=7, input_clip_length=6)
            elif self.args.selectnet == 'SLOWFAST':
                self.model = pytorchvideo.models.slowfast.create_slowfast(model_num_class=7)
            else:
                exit(-33)

I have a forked repo in my university organization, you can find all the code here https://github.com/invett/pytorchvideo/blob/master/tutorials/video_classification_example/train.py and please forgive me if I did some nonsense thing ;-)

trigal commented 3 years ago

any idea?

haooooooqi commented 3 years ago

Hi, Thanks for playing with PTV! Regarding to the dim issue in slowfast model, you might need to try a different T size. The pair (T1, T2) of (15, 5) you used will ends up in some rounding issues. Could you try the pair of (T1, T2) that T2 can be divided by T1 with slowfast_fusion_conv_stride[0]?

zhengyu-yang commented 3 years ago

@haooooooqi I used torch.hub.load("facebookresearch/pytorchvideo", model='slowfast_r50', pretrained=True) to load the model and get similar error when T of the two pathways are (16, 64):

RuntimeError: Sizes of tensors must match except in dimension 2. Got 33 and 9 (The offending index is 0)
jpainam commented 3 years ago

Facing the same issue. did anyone find a solution to this different video length? Ths

iamharsha1999 commented 1 year ago

RuntimeError: stack expects each tensor to be equal size, but got [3, 61, 864, 1152] at entry 0 and [3, 60, 864, 1152] at entry 1

Hi everyone! I am getting the aforementioned error when using a custom dataset, does anyone know why? I assumed it was due to videos having different framerate ( so the Time - T - dimension would be different from one clip sample to the other) therefore I preprocessed them to have the same frame rate. However, it still does not work.

Thanks.

I faced the very exact issue. I was using 'random' clip sampling. I'm not exactly sure why this threw error however what really helped me to circumvent the issue was replacing random clip sampler with uniform clip sampler.