msarmiento3 commented 3 years ago

First of all, thank you very much for this amazing repository. I am trying to reproduce the results of X3D_M on AVA v2.2 following what the paper says:

Since our paper focuses on efficiency, by default, we do not increase the spatial resolution of res5 by 2× [15]. Region-of-interest (RoI) features [21] are extracted at the last feature map of res5 by extending a 2D proposal at a frame into a 3D RoI by replicating it along the temporal axis, similar as done in previous work [24, 40, 66], followed by application of frame-wise RoIAlign [27] and temporal global average pooling. The RoI features are then max-pooled and fed to a per-class, sigmoid classifier for prediction.

The network weights are initialized from the Kinetics models and we use step-wise learning rate decay, that is reduced by 10× when validation error saturates. We train for 14k iterations (68 epochs for ∼211k data), with linear warm-up [23] for the first 1k iterations and use a weight decay of 10−7, as in [15].

However, I am not able to achieve the reported MAP of 23.2. I am using K400 pretrain for initialisation and my training is stuck at 21.5. I did not expect that huge difference in MAP coming from the pretraining change. So is there anything else not reported on the paper that you use for the detection training? Or do you know which is the expected map when using K400 pretrain?

Thank you again!

yurinishikawa commented 3 years ago

@msarmiento3 Hi. I am also interested in reproducing the results of X3D on AVA v2.2. My question is, as pointed out here, the detection head is not implemented in the current codebase. Did you implement the detection head yourself to perform the training? Please excuse me for not answering your question.

350

yurinishikawa commented 3 years ago

@msarmiento3 @feichtenhofer Hi. I reviewed the code carefully and implemented the detection head for X3D by using ResNetRoIHead in head_helper.py. In concrete, my detection head of X3D_M now looks like this :

    (head): ResNetRoIHead(
      (s0_tpool): AvgPool3d(kernel_size=[16, 1, 1], stride=1, padding=0)
      (s0_roi): ROIAlign(output_size=[7, 7], spatial_scale=0.0625, sampling_ratio=0, aligned=True)
      (s0_spool): MaxPool2d(kernel_size=[7, 7], stride=1, padding=0, dilation=1, ceil_mode=False)
      (dropout): Dropout(p=0.5, inplace=False)
      (projection): Linear(in_features=192, out_features=80, bias=True)
      (act): Sigmoid()
    )

However, the mAP is stuck at around 15.7 after 30 epochs using AVA v2.2, which is way much lower than what you've mentioned above. Do you notice any difference compared to your implementation? By the way, I achieved approx. 20.2 mAP for SLOW_8x8_R50_SHORT and 24.5 for SLOWFAST_32x2_R50_SHORT, so I think my dataset preparation is fine. I'm using the sigmoid function for HEAD_ACT and binary cross-entropy as loss function.

Thanks for your help in advance!

msarmiento3 commented 3 years ago

Actually, there could be many things wrong, it is hard to know without having the X3D_M.yaml, but at first sight i see that your SPATIAL_SCALE_FACTOR is wrong. I guess, you are using the default for SLOWFAST, which is 16, but for X3D it should be 32.

yurinishikawa commented 3 years ago

@msarmiento3 Thank you so much for your quick reply. I'll try with SPATIAL_SCALE_FACTOR=32.
Here is my X3D_M.yaml. I'd appreciate it if you could point anything wrong.

TRAIN:
  ENABLE: True
  DATASET: ava 
  BATCH_SIZE: 64
  EVAL_PERIOD: 1
  CHECKPOINT_PERIOD: 1
  AUTO_RESUME: True
DATA:
  NUM_FRAMES: 16
  SAMPLING_RATE: 5
  TRAIN_JITTER_SCALES: [256, 320]
  TRAIN_CROP_SIZE: 224 
  TEST_CROP_SIZE: 224 # use if TEST.NUM_SPATIAL_CROPS: 1
  #TEST_CROP_SIZE: 256 # use if TEST.NUM_SPATIAL_CROPS: 3
  INPUT_CHANNEL_NUM: [3] 
  DECODING_BACKEND: torchvision
DETECTION:
  ENABLE: True
  ALIGNED: True
AVA:
  DETECTION_SCORE_THRESH: 0.9 
  TRAIN_PREDICT_BOX_LISTS: [
    "ava_train_v2.2.csv",
    "person_box_67091280_iou90/ava_detection_train_boxes_and_labels_include_negative_v2.2.csv",
  ]
  TEST_PREDICT_BOX_LISTS: ["person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv"]
  FRAME_DIR: "/path/to/data/AVA/frames/"
  FRAME_LIST_DIR: "/path/to/data/AVA/frame_lists/"
  ANNOTATION_DIR: "/path/to/data/AVA/annotations/"
X3D:
  WIDTH_FACTOR: 2.0 
  DEPTH_FACTOR: 2.2 
  BOTTLENECK_FACTOR: 2.25
  DIM_C5: 2048
  DIM_C1: 12
RESNET:
  ZERO_INIT_FINAL_BN: True
  TRANS_FUNC: x3d_transform
  STRIDE_1X1: False
BN:
  USE_PRECISE_STATS: False #True on Kinetics
  NUM_BATCHES_PRECISE: 200 
  WEIGHT_DECAY: 0.0 
SOLVER:
  BASE_LR: 0.1 # 16 machine
  BASE_LR_SCALE_NUM_SHARDS: True
  LR_POLICY: steps_with_relative_lrs
  STEPS: [0, 10, 15, 20] 
  LRS: [1, 0.1, 0.01, 0.001]
  MAX_EPOCH: 20
  WEIGHT_DECAY: 1e-7
  WARMUP_EPOCHS: 5.0
  WARMUP_START_LR: 0.000125
  OPTIMIZING_METHOD: sgd
MODEL:
  NUM_CLASSES: 80
  ARCH: x3d
  MODEL_NAME: X3D
  LOSS_FUNC: bce # cross_entropy
  DROPOUT_RATE: 0.5
  HEAD_ACT: sigmoid
TEST:
  ENABLE: True
  DATASET: ava
  BATCH_SIZE: 8
  # CHECKPOINT_FILE_PATH: 'x3d_s.pyth' # 73.50% top1 30-view accuracy to download from the model zoo (optional).
  NUM_SPATIAL_CROPS: 1
  #NUM_SPATIAL_CROPS: 3
DATA_LOADER:
  NUM_WORKERS: 8
  PIN_MEMORY: True
NUM_GPUS: 8
NUM_SHARDS: 1
RNG_SEED: 0
OUTPUT_DIR: .

yurinishikawa commented 3 years ago

Thanks for your help the other day. I ran 20-epochs training with SPATIAL_SCALE_FACTOR=32 and got 18.0 mAP for X3D_M. Although it is lower than what you've mentioned, I saw significant improvements in dynamic actions (such as 'walk', 'eat', 'dance'). I got 21.4 mAP for X3D_L.

msarmiento3 commented 3 years ago

Let it train, don't decrease the learning rate until validation map saturates. That LR schedule is for Slowfast, in my experience X3D needs more epochs than slowfast to reach the target map

yurinishikawa commented 3 years ago

@msarmiento3 I'll try again with more training epochs. Thanks for your help.

ZainZhao commented 3 years ago

Thank you for your discussion, I wonder if run X3D models is needed us to transform the video to frame, or just in AVA not in Kinetics?

Thank you

Gi-gigi commented 3 years ago

@msarmiento3 @feichtenhofer Hi. I reviewed the code carefully and implemented the detection head for X3D by using ResNetRoIHead in head_helper.py. In concrete, my detection head of X3D_M now looks like this :
    (head): ResNetRoIHead(
      (s0_tpool): AvgPool3d(kernel_size=[16, 1, 1], stride=1, padding=0)
      (s0_roi): ROIAlign(output_size=[7, 7], spatial_scale=0.0625, sampling_ratio=0, aligned=True)
      (s0_spool): MaxPool2d(kernel_size=[7, 7], stride=1, padding=0, dilation=1, ceil_mode=False)
      (dropout): Dropout(p=0.5, inplace=False)
      (projection): Linear(in_features=192, out_features=80, bias=True)
      (act): Sigmoid()
    )
However, the mAP is stuck at around 15.7 after 30 epochs using AVA v2.2, which is way much lower than what you've mentioned above. Do you notice any difference compared to your implementation? By the way, I achieved approx. 20.2 mAP for SLOW_8x8_R50_SHORT and 24.5 for SLOWFAST_32x2_R50_SHORT, so I think my dataset preparation is fine. I'm using the sigmoid function for HEAD_ACT and binary cross-entropy as loss function.

Thanks for your help in advance!

I saw you mentioned the mAP for SLOWFAST_32x2_R50_SHORT, I am currently doing its test content.But related problems appeared in the test: /media/UBUNTU/anaconda/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/media/UBUNTU/anaconda/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/media/UBUNTU/VideoRecongnization/SlowFast5_test_mydatas/SlowFast-master/slowfast/datasets/loader.py", line 62, in detection_collate inputs, video_idx = default_collate(inputs), default_collate(video_idx) File "/media/UBUNTU/anaconda/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 84, in default_collate return [default_collate(samples) for samples in transposed] File "/media/UBUNTU/anaconda/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 84, in return [default_collate(samples) for samples in transposed] File "/media/UBUNTU/anaconda/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 56, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [3, 8, 224, 280] at entry 0 and [3, 8, 224, 398] at entry 1__ I don't know how to solve it.Would you mind providing you with the SLOWFAST_32x2_R50_SHORT.yaml file for performing the test? Can you give me some advice on this issue? Please give pointers. This is my SLOWFAST_32x2_R50_SHORT.yaml: _TRAIN: ENABLE: False DATASET: ava BATCH_SIZE: 2 #64 EVAL_PERIOD: 5 CHECKPOINT_PERIOD: 1 AUTO_RESUME: True CHECKPOINT_FILE_PATH: . #path to pretrain model CHECKPOINT_TYPE: caffe2 DATA: NUM_FRAMES: 32 SAMPLING_RATE: 2 TRAIN_JITTER_SCALES: [256, 320] TRAIN_CROP_SIZE: 224 TEST_CROP_SIZE: 224 INPUT_CHANNEL_NUM: [3, 3] PATH_TO_DATA_DIR: '/media/UBUNTU/VideoRecongnization/ava_test_videos/ava' DETECTION: ENABLE: True ALIGNED: False AVA: FRAME_DIR: '/media/gigigi/UBUNTU/VideoRecongnization/ava_test_videos/ava/frames' FRAME_LIST_DIR: '/media/gigigi/UBUNTU/VideoRecongnization/ava_test_videos/ava/frame_lists' ANNOTATION_DIR: '/media/gigigi/UBUNTU/VideoRecongnization/ava_test_videos/ava/annotations' DETECTION_SCORE_THRESH: 0.8 TEST_PREDICT_BOX_LISTS: ["person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv"]

SLOWFAST: ALPHA: 4 BETA_INV: 8 FUSION_CONV_CHANNEL_RATIO: 2 FUSION_KERNEL_SZ: 7 RESNET: ZERO_INIT_FINAL_BN: True WIDTH_PER_GROUP: 64 NUM_GROUPS: 1 DEPTH: 50 TRANS_FUNC: bottleneck_transform STRIDE_1X1: False NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]] SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [2, 2]] SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [1, 1]] NONLOCAL: LOCATION: [[[], []], [[], []], [[], []], [[], []]] GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]] INSTANTIATION: dot_product POOL: [[[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]]] BN: USE_PRECISE_STATS: False NUM_BATCHES_PRECISE: 200 SOLVER: BASE_LR: 0.1 LR_POLICY: steps_with_relative_lrs STEPS: [0, 10, 15, 20] LRS: [1, 0.1, 0.01, 0.001] MAX_EPOCH: 20 MOMENTUM: 0.9 WEIGHT_DECAY: 1e-7 WARMUP_EPOCHS: 5.0 WARMUP_START_LR: 0.000125 OPTIMIZING_METHOD: sgd MODEL: NUM_CLASSES: 80 ARCH: slowfast MODEL_NAME: SlowFast LOSS_FUNC: bce DROPOUT_RATE: 0.5 HEAD_ACT: sigmoid TEST: ENABLE: True DATASET: ava BATCH_SIZE: 8 CHECKPOINT_FILE_PATH: '/media/UBUNTU/VideoRecongnization/SlowFast5_test_mydatas/SlowFast-master/configs/AVA/c2/SLOWFAST_32x2_R101_50_50_test.pkl' #path to pretrain model DATA_LOADER: NUM_WORKERS: 2 PIN_MEMORY: True NUM_GPUS: 1 NUM_SHARDS: 1 RNG_SEED: 0 OUTPUTDIR: . No problem with file path. Thanks for your help in advance!

BonGum commented 10 months ago

@yurinishikawa Thank you so much！ my dear friend.

facebookresearch / SlowFast

X3D for detection MAP mismatch #371

350