Closed Tejas-Haritsa-vk closed 11 months ago
Should it not be model.compute_video_aggregation(video_embeds, 1)
instead of 4
? You seem to be using only one input so batch size is 1. Can you try with this modification? Also, try torch.tensor([2])
if there is still a mismatch.
Isn't data of shape [16, 4, 3, 224, 224] (B, T, C, H, W) of batch size 16 and temporal dimension 4?
No, [16, 4, 3, 224, 224]
means we choose 16 clips from 1 video and each clip has 4 frames of 3x224x224 images. You are currently looking at the dataloader, if you look at the collated batch the input is [B, 16, 4, 3, 224, 224] where B is the batch size, meaning the number of 'distinct' videos in the batch.
okay, makes more sense now. But, How do I get [B, 16, 4, 3, 224, 224]
? cuz right now, I'm using
if __name__ == "__main__":
kwargs = dict(
dataset_name="HowTo100M_VC_dataset",
text_params=None,
video_params={
"input_res": 224,
"num_frames": 4,
"loading": "strict"
},
data_dir="path_to_data_dir",
meta_dir="path_to_meta_dir",
tsfms=init_video_transform_dict()['train'],
reader='cv2_howto100m',
split='val',
neg_param=60
)
dataset = VideoDataLoader(**kwargs)
and I'm getting [16, 4, 3, 224, 224]
Edit: Also In your research paper you have mentioned "We use a batch size of 16 per GPU for
short-term contrastive learning and 1 per GPU for long-term
video-level contrastive learning. Recall that one video-level
batch consists of 16 clips of the same video." So doesn't that mean [1,16, 4, 3, 224, 224]
for long term? and if so how are the video_embeds and labels/targets handled/matched for the same?
P.S to give more context to my earlier (1st) comment, I am getting ValueError: Expected input batch_size (4) to match target batch_size (0).
when I run the code with the config provided.
Edit: Also, I when I ran with batch_size of 4 model/video_transformer.py throws error saying:
File ".../HierVL/model/video_transformer.py", line 310, in forward_features
b, curr_frames, channels, _, _ = x.shape
ValueError: too many values to unpack (expected 5)
where x shape is [4,16, 4, 3, 224, 224]. Please help me resolve this.
I'm using the below code and data["video"] is from torch.utils.data.DataLoader(dataset, batch_size=4):
with torch.set_grad_enabled(True):
video_embeds = self.model.module.compute_video(data['video'])
video_embeds = self.model.module.compute_video_aggregation(video_embeds, self.batch_size)
video_predictions = self.model.module.head_ht100m_linear_probe(video_embeds)
video_predictions = self.allgather(video_predictions.contiguous(), self.n_gpu, self.args)
video_labels = self.allgather(data['label'], self.n_gpu, self.args)
Sorry for the misunderstanding. The model reshapes [B, 16, 4, 3, 224, 224] into [Bx16, 4, 3, 224, 224] when giving to the model and that explains your error about too many values to unpack. And since the model takes Bx16, we need to specify the actual batch size so that the model can re-arrange [Bx16, ...] back into [B, 16, ...]. Can you verify if this works?
Wait, Now I'm confused. Lets say we have a data of shape:[B, nC, T, C, H, W]
where,
B = batch_size = 4
nC = no. of clips per video = 16
T = temporal dimension (no. of frames per video clip) = 4
C = channels = 3
H = height = 224
W = width = 224
with the above context, what shape would the data have to be when we feed it into the model to compute video_embeds?
where, video_embeds = self.model.module.compute_video(data['video'])
[B, nC, T, C, H, W] or [BxnC, T, C, H, W]?
and if it is [BxnC, T, C, H, W] how can I convert [B, nC, T, C, H, W] to [BxnC, T, C, H, W]
as currently I'm using torch.utils.data.DataLoader(dataset, batch_size=4) which produces data of shape [B, nC, T, C, H, W]
In the implemented code, data['video']
will be [BxnC, T, C, H, W]
. If you try [B, nC, T, C, H, W]
, you will get ValueError: too many values to unpack (expected 5)
(as you noted above).
To convert [B, nC, T, C, H, W] to [BxnC, T, C, H, W]
you can simple use .reshape()
or .view()
from pytorch library. Note that we use a custom collate function defined as
We use torch.cat
instead of the standard torch.stack
exactly for this purpose.
PS: Why do we do this?
Standard video architectures take input in the form [B, T, C, H, W]
since they do not model long videos. To maintain the consistency, we absorb the nC
into B
"temporarily". Once we get the features, we convert it back before the aggregation step. Does this make sense?
Ah, Makes complete sense now. Thank you so much. Will be closing this ticket now.
P.S
Can you verify if this works?
Yes, Now that I follow the above mentioned BxnC it is working. Thanks again for such quick responses.
Hi @thechargedneutron, I was trying to use the model separately and noticed that the howto100m dataloader(
HowTo100M_VC_dataset.py
) produces an input to model of shapetorch.Size([16, 4, 3, 224, 224])
and label of shapetorch.Size([])
. when fed into the model using the below lines on code:video_predictions.shape = torch.Size([4, 100])
and due to the shape mismatch I get the following error ValueError: Expected input batch_size (4) to match target batch_size (0).
Please help me resolve this. I am not sure if I am doing anything wrong here, and have followed the model prediction steps as per
trainer_howto100m_classification.py
.