facebookresearch / video-long-term-feature-banks

Long-Term Feature Banks for Detailed Video Understanding
Apache License 2.0
372 stars 62 forks source link

Unable to reproduce the results on Charades by using 4 GPUs #22

Closed avijit9 closed 5 years ago

avijit9 commented 5 years ago

Hi,

I am trying to replicate the Resnet-50-baseline experiment on the Charades dataset. I'm using the following config -

DATASET: charades
DATADIR: /ssd_scratch/cvit/avijit/datasets/charades/Charades_v1_rgb

NUM_GPUS: 4
LOG_PERIOD: 10

MODEL:
  NUM_CLASSES: 157
  MODEL_NAME: resnet_video
  BN_MOMENTUM: 0.9
  BN_EPSILON: 1.0000001e-5
  ALLOW_INPLACE_SUM: True
  ALLOW_INPLACE_RELU: True
  ALLOW_INPLACE_RESHAPE: True
  MEMONGER: True

  BN_INIT_GAMMA: 0.0
  DEPTH: 50
  VIDEO_ARC_CHOICE: 2

  MULTI_LABEL: True
  USE_AFFINE: True

RESNETS:
  NUM_GROUPS: 1  # ResNet: 1x; RESNETS: 32x
  WIDTH_PER_GROUP: 64  # ResNet: 64d; RESNETS: 4d
  TRANS_FUNC: bottleneck_transformation_3d # bottleneck_transformation, basic_transformation

TRAIN:
  DATA_TYPE: train
  BATCH_SIZE:  8 #16
  EVAL_PERIOD: 4000
  JITTER_SCALES: [256, 320]

  COMPUTE_PRECISE_BN: False
  CROP_SIZE: 224

  VIDEO_LENGTH: 32
  SAMPLE_RATE: 4
  DROPOUT_RATE: 0.3
  PARAMS_FILE: pretrained_weights/r50_k400_pretrained.pkl
  DATASET_SIZE: 7811
  RESET_START_ITER: True

TEST:
  DATA_TYPE: val
  BATCH_SIZE: 4 #16
  CROP_SIZE: 256
  SCALE: 256

  VIDEO_LENGTH: 32
  SAMPLE_RATE: 4

  DATASET_SIZE: 1814

SOLVER:
  LR_POLICY: 'steps_with_relative_lrs' # 'step', 'steps_with_lrs', 'steps_with_relative_lrs', 'steps_with_decay'
  BASE_LR: 0.01
  #STEP_SIZES: [20000, 4000]
  STEP_SIZES: [20000, 4000, 20000, 4000]
  LRS: [1, 0.1, 0.1, 0.1]
  MAX_ITER: 48000

  WEIGHT_DECAY: 0.0000125
  WEIGHT_DECAY_BN: 0.0
  MOMENTUM: 0.9
  NESTEROV: True
  SCALE_MOMENTUM: True

CHECKPOINT:
  DIR: '.'
  CHECKPOINT_PERIOD: 4000
  CONVERT_MODEL: True

NONLOCAL:
  USE_ZERO_INIT_CONV: True
  USE_BN: False
  USE_AFFINE: True
  CONV3_NONLOCAL: True
  CONV4_NONLOCAL: True
  USE_SCALE: True

As you can see, I am using 4 GPUs. So, I have reduced the batch size and learning rate by half. But the highest mAP I am getting is ~ 36.0. But if I do the test using your pre-trained model, I can get ~38 mAP. Can you please check my config file and suggest some changes necessary?

chaoyuaw commented 5 years ago

Hi @avijit9, One issue I can see is that the schedule should be STEP_SIZES: [40000, 8000] (following linear scaling rule https://arxiv.org/pdf/1706.02677.pdf)

avijit9 commented 5 years ago

@chaoyuaw Thanks for your prompt reply. I am going to run this experiment now with your suggestion and let you know the outcome.

Btw, thanks for this amazing repo. :)

chaoyuaw commented 5 years ago

Of course :) Let me know how it goes. Closing this now, but please feel free to reopen if you see other issues. Thanks!

avijit9 commented 5 years ago

It worked like charm! Thanks again for your help.

avijit9 commented 5 years ago

I am a bit confused. What is the difference between I3D and 3D CNN in table 4 of the LFB paper? Both are using R50-I3D-NL. Does 3D CNN represent your implementation while the other one is from the paper you cited?

chaoyuaw commented 5 years ago

Glad to hear that it worked!

Yes, the term "3D CNN" describes the "meta-architecture", and it refers to the the design in Figure 3(a). It can use different "backbones".

In Table 4, "3D CNN with R50-I3D-NL" is a "3D CNN" design (Figure 3(a)) using "R50-I3D-NL" as backbone.

You're right that the only difference between "3D CNN" and "I3D-NL" in Table 4 is implementation detail of backbone (See Appendix for details of ours), but nothing fundamentally different. Sorry for the confusion!

avijit9 commented 5 years ago

Thanks a lot :)