HYPJUDY / Decouple-SSAD

Decoupling Localization and Classification in Single Shot Temporal Action Detection
https://arxiv.org/abs/1904.07442
MIT License
96 stars 19 forks source link

Performance on ActivityNet? #2

Closed jiujing23333 closed 5 years ago

jiujing23333 commented 5 years ago

Hi, Yupan. Thanks for your impressed work. Have you test the model on ActivityNet v1.3?

rahman-mdatiqur commented 5 years ago

Yes, I was also willing to know how well it performs on a challenging temporally untrimmed dataset like ActivityNet.

HYPJUDY commented 5 years ago

Hi, thanks for your concern! I've had a try on ActivityNet with SSAD but I didn't make it work very well. I think if SSAD can work, then Decouple-SSAD can improve SSAD since Decouple-SSAD strengthen SSAD in network architecture.

My results are much lower than normal values (mine<0.2 VS others>0.3 map@0.5 if I remember correctly) so I doubted that I didn't extract the right features for ActivityNet. I didn't have time to do more experiments at that time. There's still room for improvement.

Here's the main difference I considered on two datasets for your reference. Most of the action instances length of THUMOS14 is smaller than 512 frames so the Decouple-SSAD adopt the same preprocess as SSAD to slide through windows by a window length of 512. Then three anchor layers of 5 scale ratios can cover most of the action instances.

The window length Lw is set to 512, which means approximately 20 seconds of video with 25 fps. This choice is based on the fact that 99.3% action instances in the training set have smaller length than 20 seconds. (Quoted from SSAD)

However, ActivityNet has larger scale in video length and action instance length (can be very short and very long). So I adopt the similar network architecture and ratio scales as Prop-SSAD for ActivityNet. image Image credit: Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017

Maybe you can explore more on how to deal with the variable scales of video length and action instance length in one-stage temporal action detection methods. You are encouraged to do some experiments on other datasets : ) Pls let me know if you succeed!

rahman-mdatiqur commented 5 years ago

Dear @HYPJUDY thanks a lot for your really useful insights into the issue. I highly appreciate it.

HYPJUDY commented 5 years ago

You are welcome! We can discuss more if you encounter other problems since I struggled with ActivityNet for a time and will be very pleased to know if someone can solve it!

jiujing23333 commented 5 years ago

Thanks for your insightful discussion, I'll try it.

HYPJUDY commented 4 years ago

Hi all, as for ActivityNet, maybe you can try to set much smaller ratios for classification loss (Lcls) and localization loss (Lloc) while much bigger ratio for overlap loss (Lov). That is to say, as for the equation (9) in Decouple-SSAD,

${L} = \alpha \cdot{L}_{cls} + \beta \cdot{L}_{reg} + \gamma \cdot{L}_{ov}$

I set alpha, beta and gamma to 1, 10 and 10 respectively. These parameters are suitable for THUMOS14 but maybe not suitable for ActivityNet. Another single-stage paper set these parameters to 1, 2 and 75, which are very different from mine and works on ActivityNet. This is probably not the intrinstic reason and solution but worth to try.

rahman-mdatiqur commented 4 years ago

Thanks @HYPJUDY . Really appreciate the pointer!