facebookresearch / SlowFast

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
Apache License 2.0
6.59k stars 1.21k forks source link

Reproduce the result of MViT on AVA #595

Open tranxuantuyen opened 2 years ago

tranxuantuyen commented 2 years ago

Hi, thanks for the code base

I'm trying to reproduce the results on AVA dataset with MViT model. I noticed that while the code is available, the config for finetuning was not provided. I built the config file from the implementation details reported in the paper but only got around 20 mAP. Am I do the right way to reproduce the results?

Any suggestions or discussions are welcome, thank you

TRAIN:
  ENABLE: True
  DATASET: ava
  BATCH_SIZE: 128
  EVAL_PERIOD: 1
  CHECKPOINT_PERIOD: 10
  AUTO_RESUME: True
  CHECKPOINT_FILE_PATH: checkpoints/MViTv2_S_16x4_k400_f302660347.pyth
  CHECKPOINT_TYPE: pytorch
  CHECKPOINT_EPOCH_RESET: True
DATA:
  NUM_FRAMES: 16
  SAMPLING_RATE: 4
  TRAIN_JITTER_SCALES: [256, 320]
  TRAIN_CROP_SIZE: 224
  TEST_CROP_SIZE: 224
  INPUT_CHANNEL_NUM: [3]
  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]
  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]
DETECTION:
  ENABLE: True
  ALIGNED: True
AVA:
  BGR: False
  DETECTION_SCORE_THRESH: 0.8
  TEST_PREDICT_BOX_LISTS: ["person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv"]
  FRAME_DIR: "Workspace/dataset/ava/frames/"
  FRAME_LIST_DIR: "Workspace/dataset/ava/frame_lists/"
  ANNOTATION_DIR: "Workspace/dataset/ava/annotations/"
  TRAIN_PREDICT_BOX_LISTS: [
    "ava_train_v2.2.csv",
    "person_box_67091280_iou90/ava_detection_train_boxes_and_labels_include_negative_v2.2.csv"
  ]
MVIT:
  ZERO_DECAY_POS_CLS: False
  USE_ABS_POS: False
  SEP_POS_EMBED: True
  REL_POS_SPATIAL: True
  REL_POS_TEMPORAL: True
  DEPTH: 16
  NUM_HEADS: 1
  EMBED_DIM: 96
  PATCH_KERNEL: (3, 7, 7)
  PATCH_STRIDE: (2, 4, 4)
  PATCH_PADDING: (1, 3, 3)
  MLP_RATIO: 4.0
  QKV_BIAS: True
  DROPPATH_RATE: 0.4
  NORM: "layernorm"
  MODE: "conv"
  CLS_EMBED_ON: False
  DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]
  HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]
  POOL_KVQ_KERNEL: [3, 3, 3]
  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]
  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 2, 2], [2, 1, 1, 1], [3, 1, 2, 2], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 2, 2], [15, 1, 1, 1]]
  DROPOUT_RATE: 0.0
  DIM_MUL_IN_ATT: True
  RESIDUAL_POOLING: True
AUG:
  NUM_SAMPLE: 2
  ENABLE: True
  COLOR_JITTER: 0.4
  AA_TYPE: rand-m7-n4-mstd0.5-inc1
  INTERPOLATION: bicubic
  RE_PROB: 0.25
  RE_MODE: pixel
  RE_COUNT: 1
  RE_SPLIT: False
MIXUP:
  ENABLE: False
  ALPHA: 0.8
  CUTMIX_ALPHA: 1.0
  PROB: 1.0
  SWITCH_PROB: 0.5
  LABEL_SMOOTH_VALUE: 0.1
BN:
  USE_PRECISE_STATS: False
  NUM_BATCHES_PRECISE: 200
SOLVER:
  ZERO_WD_1D_PARAM: True
  CLIP_GRAD_L2NORM: 1.0
  BASE_LR_SCALE_NUM_SHARDS: True
  BASE_LR: 0.6
  COSINE_END_LR: 0.06
  WARMUP_START_LR: 0.06
  WARMUP_EPOCHS: 5.0
  LR_POLICY: cosine
  MAX_EPOCH: 30
  MOMENTUM: 0.9
  WEIGHT_DECAY: 1e-8
  OPTIMIZING_METHOD: sgd
  COSINE_AFTER_WARMUP: True
MODEL:
  NUM_CLASSES: 80
  ARCH: mvit
  MODEL_NAME: MViT
  LOSS_FUNC: bce
  DROPOUT_RATE: 0.5
  HEAD_ACT: sigmoid
TEST:
  ENABLE: True
  DATASET: ava
  BATCH_SIZE: 8
  NUM_SPATIAL_CROPS: 1
DATA_LOADER:
  NUM_WORKERS: 32
  PIN_MEMORY: True
NUM_GPUS: 8
NUM_SHARDS: 1
RNG_SEED: 0
OUTPUT_DIR: Workspace/project/Ava_Transforme/log_out
TENSORBOARD:
  ENABLE: True
Balakishan77 commented 2 years ago

Hi @tranxuantuyen,

I am also interested to make use of MViT models on AVA dataset.

I have noticed that there are no AVA pretrained models available slowfast, if you have trained from scratch 30 epochs, that seems to be the reason for low mAP.

Happy to discuss more on this.

e4s2022 commented 2 years ago

@Balakishan77 @tranxuantuyen hi,

I am also trying to reproduce the MViT these days. I first tried to pre-train the MViT from scratch but found the recon. results are poor, especially for the colour.

In this case, I switched to the well-trained model provided by the authors, and tried to make a sanity check on the K400 dataset. The well-trained model I used is MViT-B, i.e., k400_VIT_B_16x4_MAE_PT, which can be found here. Unluckily, the recon. results are still poor, here are some examples:

image image image

The input videos are chosen from the validation set of the original K400. However in the paper, the shown recon. results look good. image

Could you show some video reconstruction results during the pre-training stage? Many thanks.

innat commented 1 year ago

Similar dummy outputs https://github.com/facebookresearch/SlowFast/issues/668