can not reproduce the AP on coco dataset

HuYanchen-hub commented 1 year ago

When I downloaded the pre-trained model on the coco dataset you provided for inference, I found that the instance segmentation accuracy of the coco data set always differs by 0.2AP. The following is the experimental result.

METHOD | BACKBONE | PQ | PQTH | PQST | AP | MIOU | #PARAMS | CONFIG | CHECKPOINT -- | -- | -- | -- | -- | -- | -- | -- | -- | -- OneFormer | Swin-L† | 57.9 | 64.4 | 48.0 | 49.0 | 67.4 | 219M | [config](https://github.com/SHI-Labs/OneFormer/issues/configs/coco/swin/oneformer_swin_large_bs16_100ep.yaml) | [model](https://shi-labs.com/projects/oneformer/coco/150_16_swin_l_oneformer_coco_100ep.pth) | | 57.9 | 64.4 | 48.0 | **48.8** | 67.4 | | | OneFormer | DiNAT-L† | 58.0 | 64.3 | 48.4 | 49.2 | 68.1 | 223M | [[config](https://github.com/SHI-Labs/OneFormer/blob/main/configs/coco/dinat/oneformer_dinat_large_bs16_100ep.yaml)] | [[model](https://shi-labs.com/projects/oneformer/coco/150_16_dinat_l_oneformer_coco_100ep.pth)] | | | | 58.0 | 64.3 | **48.3** | **49.0** | 68.1 | |

The following is my experimental environment.

Environment info:
-------------------------------  ------------------------------------------------------------------------------------------------
sys.platform                     linux
Python                           3.8.16 (default, Jun 12 2023, 18:09:05) [GCC 11.2.0]
numpy                            1.24.3
detectron2                       0.6 @/home/bingxing2/gpuuser206/OneFormer/detectron2/detectron2
Compiler                         GCC 6.3
CUDA compiler                    CUDA 11.3
detectron2 arch flags            7.0
DETECTRON2_ENV_MODULE            <not set>
PyTorch                          1.10.1 @/home/bingxing2/gpuuser206/.conda/envs/oneformer/lib/python3.8/site-packages/torch
PyTorch debug build              False
torch._C._GLIBCXX_USE_CXX11_ABI  False
GPU available                    Yes
GPU 0,1,2,3,4,5,6,7              NVIDIA A100-PCIE-40GB (arch=8.0)
Driver version                   510.47.03
CUDA_HOME                        /usr/local/cuda
Pillow                           9.5.0
torchvision                      0.11.2 @/home/bingxing2/gpuuser206/.conda/envs/oneformer/lib/python3.8/site-packages/torchvision
torchvision arch flags           3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                           0.1.5.post20221221
iopath                           0.1.9
cv2                              4.7.0
-------------------------------  ------------------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

praeclarumjj3 commented 1 year ago

Hi @HuYanchen-hub, thanks for your interest.

Do you set the task=instance when running the evaluation script? The numbers you shared seem to correspond to task=panopticin the evaluation script. We mention the same in the instructions here.

I ran the evaluation on an A100 myself now and obtained the following results for the DiNAT-L backbone:

#### DiNAT-L Oneformer
[07/05 05:47:28 d2.evaluation.testing]: copypaste: Task: segm
[07/05 05:47:28 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[07/05 05:47:28 d2.evaluation.testing]: copypaste: 49.2071,73.8117,53.6113,29.4197,53.7316,70.9744

You might experience a variance of 0.1-0.2 units when running evaluations on different machines (I remember noticing something like that while experimenting).

HuYanchen-hub commented 1 year ago

Thanks for your reply, when I set task=instance, I got the correct result. But when I use swin-L backbone to test instance semantic tasks with task=semantic, I get mIoU which is lower than task=panoptic. I would like to ask if the mIoU results you reported are the higher of them? Or is it due to variance caused by different experimental equipment?

copypaste: Task: sem_seg
[07/05 21:06:35 d2.evaluation.testing]: copypaste: mIoU,fwIoU,mACC,pACC
[07/05 21:06:35 d2.evaluation.testing]: copypaste: 67.2288,72.4984,78.5884,82.9312

And when I use DiNAT-L Backbone to test with task = panoptic, the experimental result of PQ_{st}is also 0.1 different from your result.

Task: panoptic_seg
[07/05 21:14:12 d2.evaluation.testing]: copypaste: PQ,SQ,RQ,PQ_th,SQ_th,RQ_th,PQ_st,SQ_st,RQ_st
[07/05 21:14:12 d2.evaluation.testing]: copypaste: 57.9436,83.7602,68.4097,64.3089,84.9244,75.2713,48.3356,82.0030,58.0525

OneFormer is a very good work, and we want to support this algorithm in the open source object detection toolbox mmdetection, so we need to master more experimental details. Thank you for your help.

praeclarumjj3 commented 1 year ago

Hi @HuYanchen-hub, thanks for working on adding the support of OneFormer to detection!

We report the metric scores corresponding to the metric-focused task for each task. So, we report mIoU with task=semantic. I believe the difference you notice is within the variance range both for mIoU and PQ_st.

HuYanchen-hub commented 1 year ago

Thanks for you reply!

Hi @HuYanchen-hub, thanks for working on adding the support of OneFormer to detection!

We report the metric scores corresponding to the metric-focused task for each task. So, we report mIoU with task=semantic. I believe the difference you notice is within the variance range both for mIoU and PQ_st.

Thank you very much, get it!

SHI-Labs / OneFormer

can not reproduce the AP on coco dataset #78