DAVEISHAN / TCLR

Official code repo for TCLR: Temporal Contrastive Learning for Video Representation [CVIU-2022]
MIT License
32 stars 2 forks source link

Some questions from a beginner #4

Closed Jimmycreative closed 1 year ago

Jimmycreative commented 1 year ago

Hi I'm Jimmy. I'm currently learning the video area in computer vision. I wonder is it possible to ask you some questions here? I would really appreciate for your help because I have tried to searched these questions online, but could not solve it by myself. :(

  1. At this line, I'm quite confused because are we supposed to input video data? Why do we need this? I know 5 is the batch size, 3 is the input channels, 16 is the number of frames input to the network, and the width and height are 112. Am I right?

  2. For line 15 and 17, why do we need to expand the layer? Why not just use the original resnet 18 as backbone?

  3. why do we need to define sparse_clip, dense_clip0, dense_clip1, dense_clip2, dense_clip3...? I'm not pretty sure the purpose of this, and there is no definition of sparse clip and dense clip in the paper.

  4. What is the difference between input plane and plane?Does inplane mean channel size?How about plane?

  5. Also if I want to use a different dataset to experiment the performance of this model, is there any big modification I need to notice?

Sorry to bother you. Thank you so much.

DAVEISHAN commented 1 year ago

Hey Jimmy,

  1. The purpose of main function is to debug the forward pass of the model with random input. Ultimately, we will pass the real video data from the dataloader
  2. For Global-Local temporal contrastive loss, we need to preserve the temporal resolution =4 of the feature so that we can contrast them with the temporally-aligned local clips. Using the original backbone will just provide a temporal resolution of 2 at the final layer.
  3. Sparse-clip = Global clip, Dense clip = Local clip
  4. Attaching reference from where I got the code: https://github.com/pytorch/vision/blob/master/torchvision/models/video/resnet.py
  5. Give it a try with the default parameters, let me know by email if you require further help.
Jimmycreative commented 1 year ago

Dear Dave,

I have another three questions.

  1. For the local clips, we divide each video into 4 local 16-frame clips, but is the total length of each training video is 64 frames? If not, how do we decide which consecutive 64 frames in each training video to use? I'm asking this because I'm trying to use a video sample with larger total frames per video.

  2. In terms of the downstream tasks, are the training data supposed to first pass in TCLR to get the feature representation? I couldn't find the code of this in the complete_retieval.py file. It directly uses the original video input. Not sure I have missed something.

  3. For the pre-training weights you mentioned on readme(R3D-18 with UCF101 pre-training). Is this for the TCLR or the downstream models?

Thanks

DAVEISHAN commented 1 year ago

Jimmy, I am answering your queries in the order as follows:

  1. We assume that we have a video length>64. First, we select a random 64-frame window within the total video length.(L#82). The same window is sampled as global (i.e. sparse) clip with skip=4 and 4 consecutive local (i.e dense) clips. If you have longer videos you can try out 128-frame window with params.sr_ratio= 8 and put sr_sparse = 8 in the dl_tclr.
  2. Yes, we first extract features from the original video using the TCLR pretrained model and use them for retrieval. In lines 176-224 of complete_retrieval.py we extract the features.
  3. Let me put it this way: First, we perform TCLR SSL training which takes the model from scratch --> we store the SSL checkpoint --> We load the SSL checkpoint for the downstream task. The model I have put is SSL checkpoint without any downstream specific finetuning.

Hope it answers your questions. Ishan

Jimmycreative commented 1 year ago

Dear Dave,

  1. In terms of the second question above, the parameter m_file_name loads the TCLR model at line 132 at complete_retrieval.py, and my question is: at line 62-66 at the function build_r3d_encoder_ret is this the part of the feature extractor model(TCLR) without MLP? Also, what is the difference between the parameter kin_pretrained and self_pretrained?

  2. When I tried to run the train.py in linear_eval folder, it successfully run, but I encountered something like missing frames and clip failed. I wonder do you have similar problems. I used the same UCF101 dataset as you mentioned here.

    Training Epoch 0, Batch 0, Loss: 4.72534 
    Clip ../data/UCF-101/BaseballPitch/v_BaseballPitch_g10_c01.avi is missing 1 frames
    Clip ../data/UCF-101/Basketball/v_Basketball_g18_c03.avi is missing 1 frames
    Training Epoch 0, Batch 24, Loss: 8.15910
    Training Epoch 0, Batch 48, Loss: 8.84915
    Training Epoch 0, Batch 72, Loss: 9.08288
    [263 265 267 269 271 273 275 277 279 281 283 285 287 289 291 293]
    Clip ../data/UCF-101/GolfSwing/v_GolfSwing_g22_c02.avi Failed
    Training Epoch 0, Batch 96, Loss: 9.43096
    Clip ../data/UCF-101/HorseRiding/v_HorseRiding_g14_c02.avi Failed
    Training Epoch 0, Batch 120, Loss: 9.73041
  1. There is also a problem that I have encountered when the shuffle= true in the dataloader. The solution is https://github.com/dbolya/yolact/issues/664#issuecomment-878241658 I wonder do you have had this problem before. BTW all the code ran on google collab GPU. I don't have GPU on my local machine.
    Epoch 3 started
    train at epoch 3
    Learning rate is: 0.01
    Epoch  3  failed
    ------------------------------------------------------------
    Traceback (most recent call last):
    File "train.py", line 244, in train_classifier
      model, train_loss = train_epoch(run_id, learning_rate2,  epoch, train_dataloader, model, criterion, optimizer, writer, use_cuda)
    File "train.py", line 44, in train_epoch
      for i, (inputs, label, vid_path) in enumerate(data_loader):
    File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 435, in __iter__
      return self._get_iterator()
    File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
      return _MultiProcessingDataLoaderIter(self)
    File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1072, in __init__
      self._reset(loader, first_iter=True)
    File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1105, in _reset
      self._try_put_index()
    File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1339, in _try_put_index
      index = self._next_index()
    File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 618, in _next_index
      return next(self._sampler_iter)  # may raise StopIteration
    File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py", line 254, in __iter__
      for idx in self.sampler:
    File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py", line 132, in __iter__
      yield from torch.randperm(n, generator=generator).tolist()
    RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
DAVEISHAN commented 1 year ago

*Custom dataset issue, communicated over email. Closing it.