dingfengshi / TriDet

[CVPR2023] Code for the paper, TriDet: Temporal Action Detection with Relative Boundary Modeling
MIT License
160 stars 13 forks source link

Get prediction on new videos #19

Closed SimoLoca closed 1 year ago

SimoLoca commented 1 year ago

Hi, thanks for your excellent work! I have a question regarding model inference, especially when deploying models. When we pass a video (features) to the model, it outputs a series of temporal segments along with their respective labels and associated "confidence" scores. However, in some cases, we may not know in advance how many actions are present in the video. In such situations, how do we select the correct segments? Do we simply take the top-k segments, where k is fixed beforehand?"

dingfengshi commented 1 year ago

Hi, thanks for your excellent work! I have a question regarding model inference, especially when deploying models. When we pass a video (features) to the model, it outputs a series of temporal segments along with their respective labels and associated "confidence" scores. However, in some cases, we may not know in advance how many actions are present in the video. In such situations, how do we select the correct segments? Do we simply take the top-k segments, where k is fixed beforehand?"

Hi, during the test, we choose the segments that have a classification score larger than a specific threshold and calculate the boundary point based on their corresponding offset predicted by the Trident-head. Afterward, we perform soft-NMS to remove redundant predictions.

SimoLoca commented 1 year ago

So, if I understand correctly, you have described the process of inference here, where at the end we postprocess the results with soft-NMS. But again the output of this process are temporal segments with their respective labels and associated "confidence" scores.

So, for example, if we take a video the model has never seen, which we have no groundtruth about it, and the model outputs N instances of the same action with different scores, how can we know if these represents the same 1 action (thus N-1 are wrong), or X (<N) different actions?

I hope I was clear in the explanation, thanks!

dingfengshi commented 1 year ago

So, if I understand correctly, you have described the process of inference here, where at the end we postprocess the results with soft-NMS. But again the output of this process are temporal segments with their respective labels and associated "confidence" scores.

So, for example, if we take a video the model has never seen, which we have no groundtruth about it, and the model outputs N instances of the same action with different scores, how can we know if these represents the same 1 action (thus N-1 are wrong), or X (<N) different actions?

I hope I was clear in the explanation, thanks!

Hi, we can not directly know if two action prediction belong to the same action instance. What we can do is to adjust the threshod of confidence (removing N-X the background prediction) and/or the parameters of Soft-NMS (removing N-1 -prediction with overlap) and the number of left predictions vary from different videos.

SimoLoca commented 1 year ago

Ok I understand, thanks you!