SHI-Labs / OneFormer

OneFormer: One Transformer to Rule Universal Image Segmentation, arxiv 2022 / CVPR 2023
https://praeclarumjj3.github.io/oneformer
MIT License
1.46k stars 129 forks source link

Cannot Reproduce Results (Concerning Discrepancies !) #26

Closed achen46 closed 1 year ago

achen46 commented 1 year ago

Thanks for this work. I would like to ask if you can please share the logs for Swin-L backbone on ADE20K (640×640). I tried and get similar numbers to #14 and wonder what is the issue.

Specifically, you reported the following, hence would like to be able to see the logs to under the issue. 49.8 , 35.9, 57.0

Also it would be great if you can please share the logs for Swin-L backbone using Cityscapes dataset.

P.S: Now that this work is accepted to CVPR, it is so crucial to maintain reproduciblity.

achen46 commented 1 year ago

The numbers I get after running with Swin-L backbone and crop size of 640 x 640

mIoU,fwIoU,mACC,pACC 50.8082,72.8344,64.6854,83.2033

PQ,SQ,RQ,PQ_th,SQ_th,RQ_th,PQ_st,SQ_st,RQ_st 45.6353,81.1724,54.4800,45.3905,82.1440,54.6043,46.1248,79.2292,54.2314

AP,AP50,AP75,APs,APm,APl ( Task: segm) 31.3344,49.2562,32.7532,12.5777,34.9892,49.0925

AP,AP50,AP75,APs,APm,APl (Task: bbox) 0.0000,0.0000,0.0000,0.0000,0.0000,0.0000

In this repo and paper, I see the that you claim the following numbers:

PQ,AP,mIoU 49.8,35.9,57.0

My question is how can one close the gap between the reported numbers ? basically is the mIoU reported at the end of the training the same as s.s mIoU in the github ?

I hope authors take the issue of reproduciblity seriously. I would like to ask for the release of the log for this experiment as well the CityScapes with Swin-L backbone.

P.S: My environment and all dependencies exactly follows what you recommended.

praeclarumjj3 commented 1 year ago

Hi @achen46, thank you for your interest in our work.

Please share your logs and exact details on your environment (GPU architecture and model, CUDA toolkit version, PyTorch, Torchvision, Detectron, and NATTEN versions + their compiled CUDA versions), so we can help you. That is the first piece of information any issue on an open-source repository requires. Simply stating that "it does not work with exactly following instructions" does not help.

We ran an experiment with a fresh clone of the same code (this GitHub repo) that you are having issues with, and we got the following numbers: PQ: 50.5, AP: 36.2, mIoU (s.s./m.s.): 56.6/57.6 (trained yesterday on 03/05/2023). The results are better than our reported numbers in our CVPR paper with PQ: 49.8, AP: 35.9, mIoU (s.s./m.s.): 57.0/57.7 (trained 7 months ago on 08/14/2022), where we only ran three times and reported the best number.

You can find the WandB logs for the original and reproduced runs here: WandB logs. We also share the training log with step-wise loss values for your reference and environment setup details to help your experiments.

achen46 commented 1 year ago

Thanks for providing the logs. Regarding the questions you raised. My environment is as follows:

sys.platform linux Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0] numpy 1.22.2 detectron2 0.6 @/opt/conda/lib/python3.8/site-packages/detectron2 Compiler GCC 9.4 CUDA compiler not available DETECTRON2_ENV_MODULE PyTorch 1.13.0a0+d0d6b1f @/opt/conda/lib/python3.8/site-packages/torch PyTorch debug build False GPU available Yes GPU 0,1,2,3,4,5,6,7 NVIDIA A100-SXM4-40GB (arch=8.0) Driver version 515.65.01 CUDA_HOME /usr/local/cuda TORCH_CUDA_ARCH_LIST 5.2 6.0 6.1 7.0 7.5 8.0 8.6 9.0+PTX Pillow 9.0.1 torchvision 0.14.0a0 @/opt/conda/lib/python3.8/site-packages/torchvision torchvision arch flags 5.2, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6, 9.0 fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 3.4.11

There are discrepancies between my environment and those listed in this repository. And I did not install NATTEN since my experiments only concerned Swin (commented out DiNAT, etc.).

I would like to ask if you maybe be able to reproduce your paper results with new Detectron2 and PyTorch versions -- especially Detectron2.

achen46 commented 1 year ago

This is concerning because at least 2 independent users cannot reproduce the results and achieve similar benchmarks. I don't believe even a resolution is reached in #14 , but simply closed due to user inactivity.

From another angle, your paper results should not solely depend on a particular environment (especially if it is quite older than current Detection, PyTorch versions.) And efforts to make it reproducible for newer versions is critical. Again we are not discussing getting exact numbers, but rather numbers which are closer to your reported numbers (e.g. 50 mIoU vs 57 is a huge discrepancy).

praeclarumjj3 commented 1 year ago

Hi @achen46, If you (or the other user who did not follow up) can provide your training log for your results (PQ: 45.6, AP: 31.3, mIoU: 50.8), I will help take a look. It seems you must have done something wrong to get this number, hundreds of people have used our code, and your reported case is rare.

Also, you did not follow our repo’s instructions. You did not understand that we are already using the latest existing versions. Firstly, we are already using Detectron2 and not Detectron. And we are using Detectron2-v0.6, which is still the latest official release version (since Nov 2021). Moreover, Detectron2-v0.6 only supports up to PyTorch-1.10.1 officially, so it’s only sensible to use the compatible version of the packages together. There’s still an open PR supporting the newer PyTorch version to Detectron2. I plan to upgrade our repo to PyTorch 2.0 (released 2 weeks ago) when Detectron2 is ready.

I will close this issue but feel free to open it again with your logs if you still have problems.