HYPJUDY / Decouple-SSAD

Decoupling Localization and Classification in Single Shot Temporal Action Detection
https://arxiv.org/abs/1904.07442
MIT License
95 stars 19 forks source link

about thomous featrue? #22

Closed leemengxing closed 4 years ago

leemengxing commented 4 years ago

How are the features extracted from different videos sampled consistently in time dimension? Can you tell us some details about sliding window?thks

HYPJUDY commented 4 years ago

Hi, there are two ways to sample videos. (1) Sampling fixed number of frames for each videos, so that the extracted frame number for all videos is the same. (2) Sampling by a fixed step, so that the frames can be more consistent is time dimension but the numbers of frame will vary from video to video. For example, video A and B has 256 and 512 frames respectively. (1) if output 2 frames in total, then get the 1st, 128-th frames from A and 1-st, 256-th frames from B. (2) if sample step is 128, then get the 1-st, 128-th frames from A and 1-st, 128-th, 256-th, 384-th frames from B.

leemengxing commented 4 years ago

Hi, I meet the problem that Lin Tianwei BSN had different feature extraction from activity data and thousous14 data, now it is resolved. The question you answered: (1) Corresponding to a relatively short time video, such as the feature extraction of activity, a video extracts a fixed number of frames. (2)Corresponding to relatively long-term videos, such as thousous14, a video will use sliding windows to extract features, and the last video will correspond to multiple sliding windows, but the time dimension of each window is consistent. Is this understanding correct?

HYPJUDY commented 4 years ago

Hi, using which extraction strategy not only depends on the video length. I think it's better to sample by a fixed step so that each frame represent similar temporal length. But if the video is too long or the program cannot handle different length input, it will be simpler to use a fixed number of frames to represent a video. Decouple-SSAD samples videos by a fixed step and uses sliding windows as units. BSN represents each video by a fixed number of frames and uses each video as a unit. I think both can work with proper processing.

leemengxing commented 4 years ago

@HYPJUDY if a label is less than 0.9 in the previous window and the next window, will the label be discarded?how to deal with the problem?

HYPJUDY commented 4 years ago

Sorry I don't understand your question. What's the score (label 0.9) means and what's the operation of discarding means?

leemengxing commented 4 years ago

Hi, HYPJUDY .Thank you for your reply. For example, acording to gen_data_info.py , One video will generate the first window [0,512] and the second window [128,640]. Suppose an action clip [45,595], The ratio with the first window is (512-45)/(595-45) = 0.85, others the second windows ratio is (595-128)/(595-45)=0.85, This clip can't match these two windows, is it discarded?This question has puzzled me for a long time,thks for your reply. @HYPJUDY

HYPJUDY commented 4 years ago

In your example, if the config.overlap_ratio_threshold is smaller than 0.85, both sliding windows should be kept (according to the code here). The clip can match with multiple windows. The ground truth of having action of the first and second window are [45, 512] and [128, 595] respectively. The model can learn to predict them both and the final result by combining the predictions of two windows is still [45, 595].

leemengxing commented 4 years ago

yes,we can change the config.overlap_ratio_threshold=0.85, but it's not all that flexible.Personally, I prefer the resize method, which only needs one prediction.