ColumbiaDVMM / CDC

CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos
68 stars 18 forks source link

Some question about the paper #1

Closed bityangke closed 7 years ago

bityangke commented 7 years ago

Hi Zheng, I am Ke Yang from NUDT, I send you an e-mail. But the e-mail server told me that the delivery of e-mail failed, so I post my email here: I am very sorry to bother you again. I want to ask you some questions about some details in your paper. In your CVPR 2017 paper titled "CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos":

  1. The result of per-frame labeling in Table 1, the AP is calculated on the ~410,000 action frames or ~1,300,000 frames include all the background frames?
  2. In section 3.4 "Optimization" you wrote "4 training epochs (within half a day) on THUMOS’14 with 48, 780 training windows." But when no overlapping and excluding the segments that have only background frames , there are only 20000+ segments left.
  3. And could you please send me a list of 20 action APs result of per-frame labeling?

Thank you very much in advance!

Wish you a good day! Ke Yang NUDT 2017/03/28

zhengshou commented 7 years ago

Hi Ke,

  1. In the frame-level classification task, for each action class, we want to output a rank list of all frames (no matter whether it's bg or action) from all test videos.

  2. I am not sure have you included windows in UCF which is also part of training data for THUMOS.

  3. Here you go: 0.353557126593323 0.642114296162291 0.189760542773722 0.733280357644924 0.698720198345160 0.338009204462311 0.192396853482007 0.718121969400894 0.297034795495151 0.547921125656887 0.594920564381471 0.163848854637057 0.383343482364106 0.604982580023873 0.694457766845914 0.497568304164384 0.284239934631868 0.461509067905906 0.227501374780542 0.262718506391104

Best, Zheng

shanshuo commented 6 years ago

Hi @zhengshou ,

For Ke's second question, I also have the similar puzzle. During training, do you use all background frames of validation or just a part of? I count the number of frames for each class. And I find that background is much more than others. https://docs.google.com/spreadsheets/d/1B0ToBFPy_5GHefxXSu684mEtFcDpcgYuqhLfSVw0Inc/edit?usp=sharing

When I use training set and validation set together, the loss cannot converge well. train_loss If I use validation set only to train, the loss cannot converge, either. train_loss2 But when I use training set only, the loss curve looks much more reasonable. train_loss1 Is it because the validation set has too many background? What do you think the reason might be? Thanks a lot.

zhengshou commented 6 years ago

@shanshuo As we mentioned in the paper, "To prevent including too many background frames for training, we only keep windows that have at least one frame belonging to actions"

I used both train set and val set for training.