about thumos14 label - Githubissues

menghuaa commented 2 years ago

Hello, in thumbos14, CliffDiving is a subclass of Diving, and the action instances of CliffDiving in the annotation file also belong to Diving. Why don't you use this prior knowledge to remove the action instance of CliffDiving class in the Diving class during training and add a Diving class for each predicted CliffDiving action instance during post-processing? I think an action instance belonging to two categories may make the training difficult to converge.

Pilhyeon commented 2 years ago

Thanks for your suggestion!

In fact, I have noticed some papers on fully-supervised temporal action localization that use such a label engineering technique.

However, to my knowledge, existing weakly-supervised approaches do not use it.

Therefore, we did not adopt it for a fair comparison with the previous works, although it may bring some performance gains.

menghuaa commented 2 years ago

谢谢你的建议！

事实上，我注意到一些关于全监督时间动作定位的论文使用了这种标签工程技术。

然而，据我所知，现有的弱监督方法并没有使用它。

因此，我们没有采用它来与之前的作品进行公平的比较，尽管它可能会带来一些性能提升。

Thanks for your reply. For the point annotation of Thumos14, SFnet provides four annotation files. Are these four files manually annotated? Is the Thumos 14 point annotation uniformly sampled from the ground truth mentioned in your paper generated by yourself or provided by other papers?

Pilhyeon commented 2 years ago

As I have stated in the paper, we used the automatically generated point-level labels that are provided by Moltisanti et al. (CVPR'19).

The point-level labels can be found on their project page, specifically the 'train_df_ts_in_gt.csv' file.

menghuaa commented 2 years ago

As I have stated in the paper, we used the automatically generated point-level labels that are provided by Moltisanti et al. (CVPR'19).

The point-level labels can be found on their project page, specifically the 'train_df_ts_in_gt.csv' file.

In the paper,you perform experiments about comparing different label distributions:Manual,Uniform and Gaussian. Where did you get the manual and uniform label?

Pilhyeon commented 2 years ago

The manual labels are provided by SF-Net, while the Uniform-distributed labels are generated using ground-truth intervals in the dataset construction stage before the training starts.

menghuaa commented 2 years ago

The manual labels are provided by SF-Net, while the Uniform-distributed labels are generated using ground-truth intervals in the dataset construction stage before the training starts.

I see the SF-Net,but it provides the four single-frame text files. Are these four files manually annotated?Do you use one of the txt files？

Pilhyeon commented 2 years ago

All the four files contain manually labeled annotations from different annotators. For selection, we followed SF-Net official code that randomly chooses the annotator id for each video in the dataset construction stage.

menghuaa commented 2 years ago

All the four files contain manually labeled annotations from different annotators. For selection, we followed SF-Net official code that randomly chooses the annotator id for each video in the dataset construction stage.

Thanks for your reply. Have you noticed that there is an annotation file THUMOS2.txt that the video of CliffDiving class is not marked with its parent class Diving? But in other annotation files, the videos of CliffDiving class still belong to the parent class Diving.

Pilhyeon commented 2 years ago

I am not sure whether there are any papers that reduce the Cliffdiving class to the Diving class. An example of the opposite case is the WTAL-C codebase, which is widely used as the baselines for many other works. You may check out how others do by navigating their code links here.

menghuaa commented 2 years ago

I am not sure whether there are any papers that reduce the Cliffdiving class to the Diving class. An example of the opposite case is the WTAL-C codebase, which is widely used as the baselines for many other works. You may check out how others do by navigating their code links here. Hi,I find the split_test.txt that you provide lacks three videos,for example video_test_0000270.I want to know the reason.

Pilhyeon commented 2 years ago

I followed the implementation of STPN, where it is mentioned that the test split of THUMOS'14 is the same with SSN.

In the SSN paper, the authors mentioned that "2 falsely annotated videos (“270”,“1496”) in the test set are excluded in evaluation" and they used only 210 testing videos for evaluation.

menghuaa commented 2 years ago

I followed the implementation of STPN, where it is mentioned that the test split of THUMOS'14 is the same with SSN.

In the SSN paper, the authors mentioned that "2 falsely annotated videos (“270”,“1496”) in the test set are excluded in evaluation" and they used only 210 testing videos for evaluation. Thank you very much.

daidaiershidi commented 1 year ago

I followed the implementation of STPN, where it is mentioned that the test split of THUMOS'14 is the same with SSN. In the SSN paper, the authors mentioned that "2 falsely annotated videos (“270”,“1496”) in the test set are excluded in evaluation" and they used only 210 testing videos for evaluation. Thank you very much.

Hello, I'm sorry to bother you. I am a beginner and would like to ask you why some fully supervised methods, such as actionformer, use feature lengths that are inconsistent with the feature lengths you provide. Is it because i3d uses different sampling rates when extracting features?

Pilhyeon commented 1 year ago

I followed the implementation of STPN, where it is mentioned that the test split of THUMOS'14 is the same with SSN. In the SSN paper, the authors mentioned that "2 falsely annotated videos (“270”,“1496”) in the test set are excluded in evaluation" and they used only 210 testing videos for evaluation. Thank you very much.

Hello, I'm sorry to bother you. I am a beginner and would like to ask you why some fully supervised methods, such as actionformer, use feature lengths that are inconsistent with the feature lengths you provide. Is it because i3d uses different sampling rates when extracting features?

The feature lengths depend on the sampling rate and the total number of frames. Actionformer adopts a smaller stride of 4 (vs. 16 for ours) with a video fps of 30 (vs. 25 for ours).

wj0323i commented 7 months ago

Hello, I would like to ask that the point label is frame-level, and the video is divided into 16 frames. So how to apply this point-level classification loss? One is a frame and the other is a segment. Looking forward to your reply.

Pilhyeon commented 7 months ago

Hello, I would like to ask that the point label is frame-level, and the video is divided into 16 frames. So how to apply this point-level classification loss? One is a frame and the other is a segment. Looking forward to your reply.

The segment, within which the labeled point (frame) falls, is utilized as a positive sample for the point-level loss.

wj0323i commented 7 months ago

Hello, I would like to ask that the point label is frame-level, and the video is divided into 16 frames. So how to apply this point-level classification loss? One is a frame and the other is a segment. Looking forward to your reply.

The segment, within which the labeled point (frame) falls, is utilized as a positive sample for the point-level loss.

Thank you for your reply, and I wish you a happy life!

Pilhyeon / Learning-Action-Completeness-from-Points

about thumos14 label #8