AnnCumtb commented 4 months ago

I wonder how the dataset organized, I checked the egoprocel.json file, but I didn't understand the hdl_actions numbers mean. And I also want to know that the I3D extract the features whether use the optical flow features? If possible, could u send me a copy of your dataset that would make it more easily to learn. Thanks a lot.

gerardDonahue commented 4 months ago

Thanks for your question. hdl_actions are the unique identifiers of the action segments (action IDs). I just added the mapping file dset_jsons/egoprocel-id2action.txt which gives the mapping from action ID to action name.

We used resnet for feature extraction. If you would like to do I3D feature extraction on egoprocel, I would suggest using the following pages to 1) download egoprocel, and 2) extract features:

AnnCumtb commented 4 months ago

Thanks for reply! I've already use I3D to extract features on egoprocel, however, when the code run into model_singleprong.py: def forward(self, videos): """ This forward function takes in inputs and returns those inputs as embedded outputs this function handles different length inputs

takes in list of tensors [(vid(i)_length, input_dimensions) for i in batch_size]

outputs list of tensors [(embedded_vid(i)_length, output_dimensions) for i in batch_size] """ outputs = [] dropouts = [] for i, sequence in enumerate(videos):

get the base model output for each frame

    base_output = sequence.to(device)
    zero_arrays = torch.zeros((self.k,) + (self.input_dimension, 14, 14), dtype=base_output.dtype, device=base_output.device)
    base_output = torch.cat((zero_arrays, base_output), dim=0)
    # temporal stacking
    this_video = torch.Tensor().long().to(device)
    for t in range(self.k, base_output.shape[0]):
        stack = base_output[t - self.k:t, :].permute(1, 0, 2, 3)
        conv5 = self.conv_embedding(stack)
        spatio = nn.MaxPool3d(kernel_size=conv5.shape[1:])(conv5)
        embedding = self.linear_embedding(spatio.squeeze())
        this_video = torch.cat((this_video, embedding[None, :]), 0)
    outputs.append(this_video)
    if self.dropping:
        dout = self.dropout(this_video).mean(dim=0)
        dropouts.append(dout)

It is obiviously that zero_arrays is a 4 dimensions but base_output is 2 dimensions, so that it's not possible to get concat. and stack = base_output[t - self.k:t, :].permute(1, 0, 2, 3) needs a 4 dimensions. I can't handle this problem, what is the 14,14 means? Looking forward to your reply!

gerardDonahue commented 4 months ago

If you are using I3D I am guessing that you are using video features of shape T by 2048 (T being the length of the video)? If this is the case, try the "TemporalStacking" model (--tstack). This one is good for learning from features after average pooling.

The x14x14 comes from the ResNet encoder (check TCC paper Appendix H).

If your extracted I3D features have some other dimensionality, please investigate building your own new architecture for learning. Thanks!

AnnCumtb commented 4 months ago

Thanks a lot!!! I already work out the problem and train the model on my dataset successfully. But I got another problem that whether the model can drop the background frame out while evaluation? What's more, I find out that if I try to use a cycle-period video to align with a long time video which contains many cycle-periods, it would happen a thing: at first it works, and then previous frame such as 40 align to 240, next frame would be 41 align to 580, moving to next cycle-period, I already use dtw to find the path. Can this problem be solved by your model?

gerardDonahue commented 3 months ago

Great questions. Cycle periods may introduce a couple of complexities here. I will try to answer below:

First, background detection is possible. Depending on your use case, you can take the drop context vector after sigmoid, and use a AUC-based approach to find the best threshold to detect background. Another method is training an SVM to detect the background if you have labels in some validation set (similar to how PC is calculated). This all depends on your data, labels, and such to find the background.

Second, it seems that the "nearest neighbor" of 40 is 240, and the "nearest neighbor" of 41 is 580 (next cycle). GTCC is designed to maximize the multi-neighbor cycle-consistency. Therefore, it doesn't matter to the model whether 240 and 580 are in different cycles, but that after "Gaussian splicing", there exists a good cycle-consistency amongst the K gaussians (with weighting from GMM) for all frames. There are many details that I don't have regarding your use case, but I would suggest checking out the heatmap of the two video similarities and plotting. This should show you the regions of high neighbor likelihood between frames. You can try to use some heuristic to ensure that nearest neighbors dont jump between cycles, but that is up to you.

gerardDonahue / GTCC_CVPR2024

Ask for the dataset organization #2

get the base model output for each frame