howto100m model input:output mismatch

Tejas-Haritsa-vk commented 11 months ago

Hi @thechargedneutron, I was trying to use the model separately and noticed that the howto100m dataloader(HowTo100M_VC_dataset.py) produces an input to model of shape torch.Size([16, 4, 3, 224, 224]) and label of shape torch.Size([]). when fed into the model using the below lines on code:

data = torch.rand([16, 4, 3, 224, 224])
video_embeds = model.compute_video(data)
video_embeds = model.compute_video_aggregation(video_embeds, 4)
video_predictions = model.head_ht100m_linear_probe(video_embeds)

loss_fn = torch.nn.CrossEntropyLoss()
loss_fn(video_predictions, torch.tensor(2))

video_predictions.shape = torch.Size([4, 100])

and due to the shape mismatch I get the following error ValueError: Expected input batch_size (4) to match target batch_size (0).

Please help me resolve this. I am not sure if I am doing anything wrong here, and have followed the model prediction steps as per trainer_howto100m_classification.py.

thechargedneutron commented 11 months ago

Should it not be model.compute_video_aggregation(video_embeds, 1) instead of 4? You seem to be using only one input so batch size is 1. Can you try with this modification? Also, try torch.tensor([2]) if there is still a mismatch.

Tejas-Haritsa-vk commented 11 months ago

Isn't data of shape [16, 4, 3, 224, 224] (B, T, C, H, W) of batch size 16 and temporal dimension 4?

thechargedneutron commented 11 months ago

No, [16, 4, 3, 224, 224] means we choose 16 clips from 1 video and each clip has 4 frames of 3x224x224 images. You are currently looking at the dataloader, if you look at the collated batch the input is [B, 16, 4, 3, 224, 224] where B is the batch size, meaning the number of 'distinct' videos in the batch.

Tejas-Haritsa-vk commented 11 months ago

okay, makes more sense now. But, How do I get [B, 16, 4, 3, 224, 224]? cuz right now, I'm using

if __name__ == "__main__":
    kwargs = dict(
        dataset_name="HowTo100M_VC_dataset",
        text_params=None,
        video_params={
        "input_res": 224,
        "num_frames": 4,
        "loading": "strict"
        },
        data_dir="path_to_data_dir",
        meta_dir="path_to_meta_dir",
        tsfms=init_video_transform_dict()['train'],
        reader='cv2_howto100m',
        split='val',
        neg_param=60
    )

    dataset = VideoDataLoader(**kwargs)

and I'm getting [16, 4, 3, 224, 224]

Edit: Also In your research paper you have mentioned "We use a batch size of 16 per GPU for short-term contrastive learning and 1 per GPU for long-term video-level contrastive learning. Recall that one video-level batch consists of 16 clips of the same video." So doesn't that mean [1,16, 4, 3, 224, 224] for long term? and if so how are the video_embeds and labels/targets handled/matched for the same?

Tejas-Haritsa-vk commented 11 months ago

P.S to give more context to my earlier (1st) comment, I am getting ValueError: Expected input batch_size (4) to match target batch_size (0). when I run the code with the config provided.

Edit: Also, I when I ran with batch_size of 4 model/video_transformer.py throws error saying:

File ".../HierVL/model/video_transformer.py", line 310, in forward_features
    b, curr_frames, channels, _, _ = x.shape
ValueError: too many values to unpack (expected 5)

where x shape is [4,16, 4, 3, 224, 224]. Please help me resolve this.

I'm using the below code and data["video"] is from torch.utils.data.DataLoader(dataset, batch_size=4):

with torch.set_grad_enabled(True):
    video_embeds = self.model.module.compute_video(data['video'])
    video_embeds = self.model.module.compute_video_aggregation(video_embeds, self.batch_size)

    video_predictions = self.model.module.head_ht100m_linear_probe(video_embeds)

    video_predictions = self.allgather(video_predictions.contiguous(), self.n_gpu, self.args)
    video_labels = self.allgather(data['label'], self.n_gpu, self.args)

thechargedneutron commented 11 months ago

Sorry for the misunderstanding. The model reshapes [B, 16, 4, 3, 224, 224] into [Bx16, 4, 3, 224, 224] when giving to the model and that explains your error about too many values to unpack. And since the model takes Bx16, we need to specify the actual batch size so that the model can re-arrange [Bx16, ...] back into [B, 16, ...]. Can you verify if this works?

Tejas-Haritsa-vk commented 11 months ago

Wait, Now I'm confused. Lets say we have a data of shape:[B, nC, T, C, H, W]where,

B = batch_size = 4
nC = no. of clips per video = 16
T = temporal dimension (no. of frames per video clip) = 4
C = channels = 3
H = height = 224
W = width = 224

with the above context, what shape would the data have to be when we feed it into the model to compute video_embeds? where, video_embeds = self.model.module.compute_video(data['video']) [B, nC, T, C, H, W] or [BxnC, T, C, H, W]? and if it is [BxnC, T, C, H, W] how can I convert [B, nC, T, C, H, W] to [BxnC, T, C, H, W] as currently I'm using torch.utils.data.DataLoader(dataset, batch_size=4) which produces data of shape [B, nC, T, C, H, W]

thechargedneutron commented 11 months ago

In the implemented code, data['video'] will be [BxnC, T, C, H, W]. If you try [B, nC, T, C, H, W], you will get ValueError: too many values to unpack (expected 5) (as you noted above).

To convert [B, nC, T, C, H, W] to [BxnC, T, C, H, W] you can simple use .reshape() or .view() from pytorch library. Note that we use a custom collate function defined as

https://github.com/facebookresearch/HierVL/blob/998a8527ed6a3306e031ee73ed81978db6e99861/data_loader/data_loader.py#L33

We use torch.cat instead of the standard torch.stack exactly for this purpose.

PS: Why do we do this? Standard video architectures take input in the form [B, T, C, H, W] since they do not model long videos. To maintain the consistency, we absorb the nC into B "temporarily". Once we get the features, we convert it back before the aggregation step. Does this make sense?

Tejas-Haritsa-vk commented 11 months ago

Ah, Makes complete sense now. Thank you so much. Will be closing this ticket now.

P.S

Can you verify if this works?

Yes, Now that I follow the above mentioned BxnC it is working. Thanks again for such quick responses.

facebookresearch / HierVL

howto100m model input:output mismatch #6