HYPJUDY / Decouple-SSAD

Decoupling Localization and Classification in Single Shot Temporal Action Detection
https://arxiv.org/abs/1904.07442
MIT License
96 stars 19 forks source link

Some problems about window_info #9

Closed dagongji10 closed 5 years ago

dagongji10 commented 5 years ago

In gen_data_info.py: (1) line 63, what does len_df = frame_count - 9 mean? How do you determine parameter 9 ? (2)line 73-74,

if is_train and n_window == 0:
        windows_start = [0]

when the frame_num < window_size, it will still put window_start=0 to window_info, when extract feature it cann't find the last few frames (window_size - frame_num), maybe I should change this for my dataset?

In config.py: (1)line 55, self.overlap_ratio_threshold = 0.9, it filters out Windows that don't overlap very well, if I change this parameter smaller, what will happen? It will reduce accuracy or speed?

HYPJUDY commented 5 years ago

This is adopted from the original SSAD code. They want to align features. They use 9 because part of their features are extracted by C3D, where videos are split into non-overlapped 16-frame clips and 9 is about in the middle of a 16-frame clip. Though I didn't use C3D feature, I simply adopted this line and forgot to change. Someone try with len_df = frame_count told me the performance will have little influence.

Yes I just repeat the last frame for short videos and since only several videos are shorter than window_size the performance will not be influenced.

I adopt the same value for the parameter self.overlap_ratio_threshold as the original SSAD code, which is 0.9. I think too small (the coverage ratio of action instances is too small in the selected windows) or too large (the training data is not enough) value for this parameter will worsen the results. I didn't finetune this parameter but you can try to get a better result. The speed is depend on the size of data so you can count the size of window to estimate speed.

dagongji10 commented 5 years ago

@HYPJUDY I chang the code with len_df = frame_count - 5, because in my dataset the action is short(some just longer a little than window_size) and extract_feature need optical_flow_frames=5.

I trained decouple-ssad with NTU-RGBD dataset use window_size=64, thanks for your help and the result looks good. imageimage

But I still have a problem, extract_feature using TSN pretrained model is really slow, because it need optical_flow and the dense_flow is slow.If I can set the param optical_flow_frames (in TSN, it is 5) with other value, maybe I don't need to calculate every frame's optiacal_flow. Have you try C3D or other model for feature-extracting before? Is there other way to extract action feature without TSN?

HYPJUDY commented 5 years ago

Glad to help and hear your suscessful try in other dataset : ) I didn't try other feature extraction methods because I read from papers that 3D extraction and two-stream extraction have similar performance. But I didn't compare their speed, maybe you can have a try. By the way, you can speed up optical flow extraction with more gpus and more processors in a gpu. Or you can sample videos with a bigger step (bigger interval between two frames) though the performance is expected to be worse to some extent. Luckily, you only need to do feature extraction once and the following experiments would be very fast.