TuanTNG / SimOn

SimOn: A Simple Framework for Online Temporal Action Localization
18 stars 3 forks source link

Various questions #8

Open LewsTherin511 opened 11 months ago

LewsTherin511 commented 11 months ago

Hi there, I've been reading the paper and checking your code, and I tried to apply it to another custom dataset for OnTAL. I have a few questions regarding the code:

  1. in the constructor of TRNTHUMOSDataLayer, you have
target = target_all[session]["anno"]
# pad feature and target
if len(video_feature) % self.stride != 0:
    tmp_data = np.zeros((self.stride - len(video_feature) % self.stride, args.dim_feature))
    video_feature = np.concatenate((video_feature, tmp_data), axis=0)
    # extend gt annotation
    tmp_data_anno = np.full((self.stride - len(video_feature) % self.stride, target.shape[-1]), -1)
    target = np.concatenate((target, tmp_data_anno), axis=0)

So, you create tmp_data and concatenate it to video_feature, so that video_feature.shape[0] is a multiple of self.stride. However, immediately below you're doing the same for target:

tmp_data_anno = np.full((self.stride - len(video_feature) % self.stride, target.shape[-1]), -1)

but at this point, len(video_feature) % self.stride = 0, because you just concatenated it with tmp_data. Shouldn't you save the original len of video_feature, and use that one instead?

original_len = len(video_feature)
if original_len % self.stride != 0:
    tmp_data = np.zeros((self.stride - original_len % self.stride, args.dim_feature))
    video_feature = np.concatenate((video_feature, tmp_data), axis=0)
    # extend gt annotation
    tmp_data_anno = np.full((self.stride - original_len, target.shape[-1]), -1)
    target = np.concatenate((target, tmp_data_anno), axis=0)
  1. In the same function, you are using:
num_windows = int((current_video_num_feature_frames + self.stride - self.num_fframes_in_window) / self.stride)

can't you just do:

num_windows =video_features.shape[0] / self.stride

can't you just do:

num_windows =video_features.shape[0] / self.stride
  1. in the function utils.frame_level_map_n_cap, you do:
        this_cls_prob = all_probs[i, :]
        this_cls_gt = all_labels[i, :]
        w = np.sum(this_cls_gt == 0) / np.sum(this_cls_gt == 1)

        indices = np.argsort(-this_cls_prob)
        tp, psum, cpsum = 0, 0.0, 0.0
        for k, idx in enumerate(indices):
            if this_cls_gt[idx] == 1:
                tp += 1
                wtp = w * tp
                fp = (k + 1) - tp
                psum += tp / (tp + fp)
                cpsum += wtp / (wtp + fp)

I don't understand why you consider the entry a TP only by checking if the corresponding GT value is equal to 1, it's like you're expecting all the elements in this_cls_prob to be positives.

I thought this_cls_prob was an array of probabilities that that specific snippet belongs to a class, so that you could have something like (assume using a threshold of 0.7):

-------------------------------------
      idx         |  0    |  1    |  2  |  3  | ...
-------------------------------------
pred_prob  | 0.9  | 0.8 | 0.7 | 0.4 | ...
frame_gt     |  1    |  0   |  1    |  1    | ...
-------------------------------------
res               |  TP |  FP |  TP | FN  | ...
-------------------------------------

And in this case, I would expect you should use something like:

if this_cls_prob > threshold and this_cls_gt[idx] == 1:
    # this is a TP
if this_cls_prob < threshold and this_cls_gt[idx] == 0:
    # this is a TN

and so on, but maybe I just misinterpreted the meaning of the arrays?