Closed huangyjhust closed 3 years ago
I extracted features using custom feature extractor on THUMOS14, and found that 30fps+sample rate=4 produced exact same length as the i3d features provided by the author.
@Finspire13 @huangyjhust
Actually author also using ten-crop features, which could achieve 2 map higher than center-crop features, please refer to https://github.com/Finspire13/CMCS-Temporal-Action-Localization/issues/4#issuecomment-536362341 what's more, the author never claim this in his paper.
So I think the paper could be a fraud, the author did not use common settings and never report that in paper.
@Finspire13 @huangyjhust
Actually author also using ten-crop features, which could achieve 2 map higher than center-crop features, please refer to #4 (comment) what's more, the author never claim this in his paper.
So I think the paper could be a fraud, the author did not use common settings and never report that in paper.
That accusation might be an overkill. I could reproduce the dimensions of released features by tuning feature extractor's hyper-parameters, i.e. stride and fps. Just some details missing in the paper and thus asking to check whether I'm doing it right.
@Finspire13 @huangyjhust Actually author also using ten-crop features, which could achieve 2 map higher than center-crop features, please refer to #4 (comment) what's more, the author never claim this in his paper. So I think the paper could be a fraud, the author did not use common settings and never report that in paper.
That accusation might be an overkill. I could reproduce the dimensions of released features by tuning feature extractor's hyper-parameters, i.e. stride and fps. Just some details missing in the paper and thus asking to check whether I'm doing it right.
Maybe it is an overkill, but I think that the author should make sure all the result could be reproduced, and report all the settings in their experiments.
@mitming @huangyjhust Thanks for running the code. Here are some clarifications.
For THUMOS, the FPS is kept the same as the original video (mostly 30 FPS). I think 25 FPS would be better though, didn't try. For ActivityNet, the FPS is set to 25 FPS.
For the sample rate (stride), please refer to the 'base_sample_rate' and 'sample_rate' parameters in the config files. Note that 'base_sample_rate' is the one used at feature extraction and 'sample_rate' is the one used for model input.
@mitming TenCrop are used for feature augmentation, which gives 2 mAP higher than no augmentation. Note that the results of ablation studies in the paper are consistently on the ten-cropped features (including the baseline), which could justify the proposed model.
All results in our paper can be reproduced (that's why this repo is here). And our trained models are also provided.
Hope this could help.
@huangyjhust Dose the sample rate (4) here means the stride size of the chunk window? Is it right that a chunk consists of 16 consecutive frames?
Further clarification,
There are two sampling rate hyper-parameters in the config files, i.e., 'base_sample_rate' is the one used at feature extraction (4 for I3D&THUMOS), and 'sample_rate' is the one used for model input (16 for I3D&THUMOS). Such difference make it possible to augment the data temporally when training (please refer this code.
@huangyjhust Dose the sample rate (4) here means the stride size of the chunk window? Is it right that a chunk consists of 16 consecutive frames?
@huge123 Yes, 'base_sample_rate' is 4 for I3D on THUMOS. An I3D chunk consists of 16 frames.
For details, please refer to the 'base_sample_rate', 'sample_rate', 'base_snippet_size' parameters in the config files.
@huangyjhust Dose the sample rate (4) here means the stride size of the chunk window? Is it right that a chunk consists of 16 consecutive frames?
@huge123 Yes, 'base_sample_rate' is 4 for I3D on THUMOS. An I3D chunk consists of 16 frames.
For details, please refer to the 'base_sample_rate', 'sample_rate', 'base_snippet_size' parameters in the config files.
@Finspire13 Thanks for your patient reply. Actually, I just want to make clear how the features are extracted, instead of the way to process the features. In simple terms, is it right that the first feature vector is from the first 16 frames, and the second feature vector is from 4-20 frames (the stride size is 4), etc. ?
@huge123 Yes it is.
I found that the extracted i3d features are based on 30fps and the frequency is every 4 frames. Is that correct? You did not re-sample the video to 25fps as common practice?