Can't reproduce box AP result on X101-FPN (COCO object detection)

I'm trying to reproduce the object detection result of the pretrained model X101-FPN from the model zoo, on COCO 2017 validation dataset. Below is the code that I used:

from detectron2.utils.logger import setup_logger
setup_logger()

# import some common libraries
import numpy as np
import cv2
import random

# import some common detectron2 utilities
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog
import matplotlib.pyplot as plt
import os
os.environ["PYTHONBREAKPOINT"]="pdb.set_trace"

from detectron2.data.datasets import register_coco_instances
register_coco_instances("COCO2017_val", {}, "./datasets/COCO2017/annotations/instances_val2017.json", "./datasets/COCO2017/images/val2017")

cfg = get_cfg()
cfg.merge_from_file("./detectron2_repo/configs/COCO-Detection/faster_rcnn_X_101_32x8d_FPN_3x.yaml")
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5  # set threshold for this model
cfg.MODEL.WEIGHTS = "detectron2://COCO-Detection/faster_rcnn_X_101_32x8d_FPN_3x/139173657/model_final_68b088.pkl"
cfg.DATASETS.TEST = ("COCO2017_val", )
predictor = DefaultPredictor(cfg)

from detectron2.evaluation import COCOEvaluator, inference_on_dataset
from detectron2.data import build_detection_test_loader
evaluator = COCOEvaluator("COCO2017_val", cfg, False, output_dir="./output/")
val_loader = build_detection_test_loader(cfg, "COCO2017_val")

result = inference_on_dataset(predictor.model, val_loader, evaluator)

And here is the result that I got:

 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.570
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.439
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.226
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.429
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.521
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.316
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.460
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.467
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.263
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.500
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.615
[01/09 18:39:57 d2.evaluation.coco_evaluation]: Evaluation results for bbox: 
|   AP   |  AP50  |  AP75  |  APs   |  APm   |  APl   |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 39.623 | 56.987 | 43.868 | 22.578 | 42.944 | 52.106 |
[01/09 18:39:57 d2.evaluation.coco_evaluation]: Per-category bbox AP: 
| category      | AP     | category     | AP     | category       | AP     |
|:--------------|:-------|:-------------|:-------|:---------------|:-------|
| person        | 53.056 | bicycle      | 31.181 | car            | 42.991 |
| motorcycle    | 41.836 | airplane     | 63.354 | bus            | 64.272 |
| train         | 60.654 | truck        | 31.527 | boat           | 26.364 |
| traffic light | 25.807 | fire hydrant | 62.720 | stop sign      | 65.393 |
| parking meter | 44.912 | bench        | 24.497 | bird           | 36.350 |
| cat           | 66.329 | dog          | 60.659 | horse          | 57.241 |
| sheep         | 49.132 | cow          | 52.811 | elephant       | 61.127 |
| bear          | 69.952 | zebra        | 64.423 | giraffe        | 64.265 |
| backpack      | 14.692 | umbrella     | 36.562 | handbag        | 13.641 |
| tie           | 33.222 | suitcase     | 39.605 | frisbee        | 63.036 |
| skis          | 23.400 | snowboard    | 35.303 | sports ball    | 46.650 |
| kite          | 38.213 | baseball bat | 29.936 | baseball glove | 37.628 |
| skateboard    | 53.447 | surfboard    | 38.341 | tennis racket  | 48.211 |
| bottle        | 36.554 | wine glass   | 34.597 | cup            | 40.142 |
| fork          | 36.118 | knife        | 18.677 | spoon          | 18.632 |
| bowl          | 37.866 | banana       | 18.211 | apple          | 20.014 |
| sandwich      | 29.821 | orange       | 25.546 | broccoli       | 20.206 |
| carrot        | 18.979 | hot dog      | 28.861 | pizza          | 49.147 |
| donut         | 38.760 | cake         | 32.874 | chair          | 25.827 |
| couch         | 38.911 | potted plant | 24.897 | bed            | 38.132 |
| dining table  | 25.054 | toilet       | 55.674 | tv             | 54.790 |
| laptop        | 58.173 | mouse        | 60.294 | remote         | 31.397 |
| keyboard      | 49.381 | cell phone   | 35.194 | microwave      | 54.765 |
| oven          | 30.566 | toaster      | 30.495 | sink           | 34.497 |
| refrigerator  | 53.195 | book         | 10.768 | clock          | 47.565 |
| vase          | 34.072 | scissors     | 26.699 | teddy bear     | 43.400 |
| hair drier    | 5.248  | toothbrush   | 23.115 |                |        |

If I read this result correctly, this means it only got 39.6 box AP, rather than 43.0 as reported in the MODEL ZOO page. The 2017 validation dataset was downloaded from COCO homepage. I couldn't find any document on the SCORE_THRESH_TEST config, so I left it as default (0.5)

My environment setup:

sys.platform              linux
Python                    3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
Numpy                     1.17.4
Detectron2 Compiler       GCC 5.4
Detectron2 CUDA Compiler  10.1
DETECTRON2_ENV_MODULE     <not set>
PyTorch                   1.3.1
PyTorch Debug Build       False
torchvision               0.4.2
CUDA available            True
GPU 0                     GeForce GTX 1080
CUDA_HOME                 /usr/local/cuda
NVCC                      Cuda compilation tools, release 10.1, V10.1.243
Pillow                    6.2.1
cv2                       4.1.2
------------------------  --------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

Please tell me if I'm missing something :) Thank you.

facebookresearch / detectron2

Can't reproduce box AP result on X101-FPN (COCO object detection) #663