Closed Jimmycreative closed 1 year ago
Hey Jimmy,
Dear Dave,
I have another three questions.
For the local clips, we divide each video into 4 local 16-frame clips, but is the total length of each training video is 64 frames? If not, how do we decide which consecutive 64 frames in each training video to use? I'm asking this because I'm trying to use a video sample with larger total frames per video.
In terms of the downstream tasks, are the training data supposed to first pass in TCLR to get the feature representation? I couldn't find the code of this in the complete_retieval.py file. It directly uses the original video input. Not sure I have missed something.
For the pre-training weights you mentioned on readme(R3D-18 with UCF101 pre-training). Is this for the TCLR or the downstream models?
Thanks
Jimmy, I am answering your queries in the order as follows:
Hope it answers your questions. Ishan
Dear Dave,
In terms of the second question above, the parameter m_file_name loads the TCLR model at line 132 at complete_retrieval.py, and my question is: at line 62-66 at the function build_r3d_encoder_ret is this the part of the feature extractor model(TCLR) without MLP? Also, what is the difference between the parameter kin_pretrained and self_pretrained?
When I tried to run the train.py in linear_eval folder, it successfully run, but I encountered something like missing frames and clip failed. I wonder do you have similar problems. I used the same UCF101 dataset as you mentioned here.
Training Epoch 0, Batch 0, Loss: 4.72534
Clip ../data/UCF-101/BaseballPitch/v_BaseballPitch_g10_c01.avi is missing 1 frames
Clip ../data/UCF-101/Basketball/v_Basketball_g18_c03.avi is missing 1 frames
Training Epoch 0, Batch 24, Loss: 8.15910
Training Epoch 0, Batch 48, Loss: 8.84915
Training Epoch 0, Batch 72, Loss: 9.08288
[263 265 267 269 271 273 275 277 279 281 283 285 287 289 291 293]
Clip ../data/UCF-101/GolfSwing/v_GolfSwing_g22_c02.avi Failed
Training Epoch 0, Batch 96, Loss: 9.43096
Clip ../data/UCF-101/HorseRiding/v_HorseRiding_g14_c02.avi Failed
Training Epoch 0, Batch 120, Loss: 9.73041
Epoch 3 started
train at epoch 3
Learning rate is: 0.01
Epoch 3 failed
------------------------------------------------------------
Traceback (most recent call last):
File "train.py", line 244, in train_classifier
model, train_loss = train_epoch(run_id, learning_rate2, epoch, train_dataloader, model, criterion, optimizer, writer, use_cuda)
File "train.py", line 44, in train_epoch
for i, (inputs, label, vid_path) in enumerate(data_loader):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 435, in __iter__
return self._get_iterator()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1072, in __init__
self._reset(loader, first_iter=True)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1105, in _reset
self._try_put_index()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1339, in _try_put_index
index = self._next_index()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 618, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py", line 254, in __iter__
for idx in self.sampler:
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py", line 132, in __iter__
yield from torch.randperm(n, generator=generator).tolist()
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
*Custom dataset issue, communicated over email. Closing it.
Hi I'm Jimmy. I'm currently learning the video area in computer vision. I wonder is it possible to ask you some questions here? I would really appreciate for your help because I have tried to searched these questions online, but could not solve it by myself. :(
At this line, I'm quite confused because are we supposed to input video data? Why do we need this? I know 5 is the batch size, 3 is the input channels, 16 is the number of frames input to the network, and the width and height are 112. Am I right?
For line 15 and 17, why do we need to expand the layer? Why not just use the original resnet 18 as backbone?
why do we need to define sparse_clip, dense_clip0, dense_clip1, dense_clip2, dense_clip3...? I'm not pretty sure the purpose of this, and there is no definition of sparse clip and dense clip in the paper.
What is the difference between input plane and plane?Does inplane mean channel size?How about plane?
Also if I want to use a different dataset to experiment the performance of this model, is there any big modification I need to notice?
Sorry to bother you. Thank you so much.