Discrepancy in Results and Evaluation Compared to Paper

IrammHamdard commented 4 months ago

Our goal was to reproduce the experiment from Chapter 4 of the paper (https://arxiv.org/pdf/2211.14293.pdf) and verify if we could achieve similar results in evaluation as in the paper or in the downloadable model on GitHub (Cityscapes Inlier Training Swin-B (1 dec layer) (https://github.com/NazirNayal8/RbA/blob/main/MODEL_ZOO.md)). Unfortunately, we found that we achieved significantly worse results than expected. I would like to find out if I made any mistakes during training and evaluation or why my results deviate so much from those in the paper.

First, I read the readme file in the repository (https://github.com/NazirNayal8/RbA/tree/main). The installation went smoothly, although I should note that I did not create a separate Conda environment. In preparing the datasets for training RbA (https://github.com/NazirNayal8/RbA/blob/main/datasets/README.md), I followed the instructions exactly, except that I created empty folders for mapillary_vistas and coco since I only need the Cityscapes dataset. We only trained the Swin-B model (1 dec layer) as specified in the configuration file on GitHub (https://github.com/NazirNayal8/RbA/blob/main/MODEL_ZOO.md), using the pretrained models for initialization.

Training was done on two NVIDIA GeForce RTX 3090 GPUs, each with 24 GB of memory.

python train_net.py \ --config-file configs/cityscapes/semantic-segmentation/swin/single_decoder_layer/maskformer2_swin_base_IN21k_384_bs16_90k_1dl.yaml \ --num-gpus 2 \ OUTPUT_DIR model_logs/swin_b_1dl/

We only changed two lines in the repository: first, we changed "IMS_PER_BATCH:" from 16 to 8 in the Base-Cityscapes-SemanticSegmentation.yaml due to hardware issues. Second, we commented out "#bdd100k=BDD100KSeg(hparams=bdd100k_config, mode='val', transforms=transform, image_size=(720, 1280))," in support.py. Evaluation was performed as follows:

python3 evaluate_ood.py --out_path results_test/ --models_folder ckpts --datasets_folder datasets --model_mode all --dataset_mode all --store_anomaly_scores

During evaluation, there is definitely no error. If you download the ckpt model (https://drive.google.com/file/d/13IJs_Kk1PMBVVxCN90HZZuuV1YcWZ0am/view?usp=sharing) from Github and evaluate it you get the same results as the paper. So something must have gone wrong during training. Do you think it's because I reduced the batch size from 16 to 8? Some Pictures of the Evaluation:

The first three pictures shows the results of the model downloaded from the repository:

These three pictures shows the results from the model we trained:

We can see that our model does not recognize all anomalies or mask them as well as the model downloaded from the repository.

Orange-rightperson commented 4 months ago

I met the same problem

IrammHamdard commented 4 months ago

"I tried training for over 90,000 iterations and also experimented with a smaller learning rate, but I didn't see any improvement in the results."

"I also attempted training with gradient accumulation, but it didn't lead to better results. The false positive rate at 95% recall (FPR95) is too high."

"Here are some results from training with gradient accumulation every second step and using a batch size of 8."

NazirNayal8 commented 4 months ago

Greetings @IrammHamdard,

Here are some notes regarding the training of the inlier model (without outlier supervision) that I hope can be helpful

First, we noticed high sensitivity to the batch size, If you you use 8 batch size with gradient accumulation to mimic 16, then try training for 180K iterations instead of 90k to have a similar effect to 16bs with 90k iterations.

Through some of our experiments, and other works that have utilized RbA, some models trained using inlier data only, despite having high mIoU on cityscapes, have demonstrated high FPR95 on either Road Anomaly (as in the case of the metrics you shared), or in FS LaF, which if you look at our MODEL ZOO, you can see that the model we trained with Swin-L achieves 71.79 FPR95 on FS LaF, which is quite high. The same model after applying our proposed outlier supervision drops the FPR95 to 4.58 as also shown in the MODEL ZOO. The conclusion is that the inlier model trained only on cityscapes is not very reliable for OoD segmentation as it can show this high variance behavior with respect to some metrics. However, the reliable and safe way to obtain good OoD performance is by applying the outlier supervision step which guarantees a more stable performance and was shown by many experiments that it performs well.

I hope this helps and if you have any further questions we would be glad to communicate and help.

NazirNayal8 / RbA

Discrepancy in Results and Evaluation Compared to Paper #12