dingfengshi / TriDet

[CVPR2023] Code for the paper, TriDet: Temporal Action Detection with Relative Boundary Modeling
MIT License
163 stars 14 forks source link

how to inference with video for real detection? #7

Closed OpenSorceYCW closed 1 year ago

OpenSorceYCW commented 1 year ago

Thanks for your excellent work. if I want to merge video feature extract and video action detection for Online detection,What should I do according your base code specifically? can you give me some advice? I am looking forward to your reply, thank you!

dingfengshi commented 1 year ago

Trident-head may not be usable because the prediction of the end boundary requires seeing several moments after a certain moment. A prediction framework that can perform action segmentation may be more suitable for you because it does not need to regress to the distance from the boundary, so maybe you can try replacing the detection head to a segmentation head first and change the groundtruth format for matching segmentation task.

OpenSorceYCW commented 1 year ago

sorry, I may not have expressed my thoughts clearly. I want to merge video feature extract and temporal action detection, so that I can build end-to-end pipeline which input is video and output is results of action detection, Look forward to your reply, thank you!

OpenSorceYCW commented 1 year ago

sorry, I may not have expressed my thoughts clearly. I want to merge video feature extraction and temporal action detection, so that I can build end-to-end pipeline which input is video and output is results of temporal action detection, Look forward to your reply, thank you!

dingfengshi commented 1 year ago

sorry, I may not have expressed my thoughts clearly. I want to merge video feature extraction and temporal action detection, so that I can build end-to-end pipeline which input is video and output is results of temporal action detection, Look forward to your reply, thank you!

Maybe you can try to choose a backbone and rewrite the input pipe. TSP has implemented a test pipeline for video input. Maybe you can modify from this code.

OpenSorceYCW commented 1 year ago

I will try it later, by the way, in the evaluator,https://github.com/dingfengshi/TriDet/blob/master/libs/utils/train_utils.py#L411 I find computing ap using all predict results regardless of score, is it reasonable? if I want to infer, what should I do? can you release the infer code, looking forward for your reply!

OpenSorceYCW commented 1 year ago

sorry, I may not have expressed my thoughts clearly. I want to merge video feature extraction and temporal action detection, so that I can build end-to-end pipeline which input is video and output is results of temporal action detection, Look forward to your reply, thank you!

Maybe you can try to choose a backbone and rewrite the input pipe. TSP has implemented a test pipeline for video input. Maybe you can modify from this code.

I can not find the test pipeline which you mentioned above, can you give me specific address?

dingfengshi commented 1 year ago

I will try it later, by the way, in the evaluator,https://github.com/dingfengshi/TriDet/blob/master/libs/utils/train_utils.py#L411 I find computing ap using all predict results regardless of score, is it reasonable? if I want to infer, what should I do? can you release the infer code, looking forward for your reply!

That's OK, the evaluation code will sort the prediction and choose top-k instances. We do not implement the infer code, you can simply modified from eval.py and see here for the TSP pipeline, but you must make the input video to 30fps before, see here.

OpenSorceYCW commented 1 year ago

I will try it later, by the way, in the evaluator,https://github.com/dingfengshi/TriDet/blob/master/libs/utils/train_utils.py#L411 I find computing ap using all predict results regardless of score, is it reasonable? if I want to infer, what should I do? can you release the infer code, looking forward for your reply!

That's OK, the evaluation code will sort the prediction and choose top-k instances. We do not implement the infer code, you can simply modified from eval.py and see here for the TSP pipeline, but you must make the input video to 30fps before, see here.

thank you for your reply! I want to use TriDet network to classify untrimmed video which has one action, what modified should i do in the base code, especially to convert result of TriDet to video class classification result, can you give me your advice? thank you again!

dingfengshi commented 1 year ago

I will try it later, by the way, in the evaluator,https://github.com/dingfengshi/TriDet/blob/master/libs/utils/train_utils.py#L411 I find computing ap using all predict results regardless of score, is it reasonable? if I want to infer, what should I do? can you release the infer code, looking forward for your reply!

That's OK, the evaluation code will sort the prediction and choose top-k instances. We do not implement the infer code, you can simply modified from eval.py and see here for the TSP pipeline, but you must make the input video to 30fps before, see here.

thank you for your reply! I want to use TriDet network to classify untrimmed video which has one action, what modified should i do in the base code, especially to convert result of TriDet to video class classification result, can you give me your advice? thank you again!

Maybe you can keep the instant with top-1 classification score and its corresponding offset for each video.

OpenSorceYCW commented 1 year ago

corresponding offset? do you mean in which process, train, eval or infer?I only use top-1 classification score in my infer process, is it reasonable? can you describe it in detail, thanks again!

dingfengshi commented 1 year ago

corresponding offset? do you mean in which process, train, eval or infer?I only use top-1 classification score in my infer process, is it reasonable? can you describe it in detail, thanks again!

In eval or infer, the code select the instants with the top-k classification score from the FPN and then choose their correspnding offsets (the distance between the chosen instant and start boundary and end boundary) decoded by the Trident head (see from here). The choice of K (of top-k) depends on your own dataset. You can try different values of k to achieve better performance. If all the videos have no more than one action, you can choose a low value. I can not directly tell what k you should use.

OpenSorceYCW commented 1 year ago

thanks , I will do research it later, Besides, if I want to use TriDet model in my custom dataset which has one action in video, How many samples are needed to train network, more even multi-action in video, How should I balance the sample size? I am looking forward to your experience and suggestions!

dingfengshi commented 1 year ago

thanks , I will do research it later, Besides, if I want to use TriDet model in my custom dataset which has one action in video, How many samples are needed to train network, more even multi-action in video, How should I balance the sample size? I am looking forward to your experience and suggestions!

That's a complicated problem, because the setting varies among different datasets and also depends on the feature extraction ability of the backbone (If you can train a recognition model for your dataset, I guess it would be better). Maybe you can refer the dataset setting of THUMOS14, which is a small dataset. However, the small dataset tend to have a high variance of the performance in each experiment. So if you want to adjust the network more conveniently, you can refer to the dataset setup of HACS, which is a more larger dataset in TAD task.