Results Re-producing - Githubissues

liudakai2 commented 1 year ago

Hi, thanks to your great work. But I have recently failed to re-produce the results in your paper (e.g., the 77.08 ACC of CIFAR10-LT) by the provided test command and pretrained model. I got the results of:

ACC: 0.7417 (0.9420, 0.7278, 0.5600)

Any suggestion?

htwang14 commented 1 year ago

Thank you for your interest in our work and sorry for the late reply.

Did you use the stage 1 checkpoint? Stage 1 checkpoint is for anomaly detection only (AUROC, AUPR, FPR). Stage 2 checkpoint is for in-distribution classification (ACC).

HDNU commented 1 year ago

Hi , stage2 model provided in the Google drive https://drive.google.com/drive/folders/1Z2VkeY8e6XIyEu995bSJBV628MXBH5ZZ?usp=sharing gives key Error when used in test.py , could you please help?

zzuy commented 1 year ago

Hello, Thanks for your great work. I have the same question about re-producing the results of the 77.08 ACC in CIFAR10-LT.

I used the stage 2 checkpoint provided in the Google Drive and ran the same test command written in the README.md, but i got ACC : 0.7417. How can we achieve the same results reported in the paper's results?

shuaiNJU commented 1 year ago

Hello, Thanks for your great work. I have the same question about re-producing the results of the 77.08 ACC in CIFAR10-LT.

I used the stage 2 checkpoint provided in the Google Drive and ran the same test command written in the README.md, but i got ACC : 0.7417. How can we achieve the same results reported in the paper's results?

Hi, do you have achieved the same results of the 77.08 ACC in CIFAR10-LT reported in the paper? Thanks a lot!

htwang14 commented 1 year ago

Dear all,

@FengShuai-bupt @HDNU @zzuy @liudakai2

Sorry for the late reply. I have quite a few things piled up over the last few months and now I finally got some time to tackle this issue. :)

Short answer (If you just want to get the reported numbers in the paper)

The stage 2 CIFAR10 model I uploaded should be tested using an older version of test codes, which I have provided as the val_cifar_only_id function in test.py. Pull the latest repo and use the following command to get the ~77.08 ACC in CIFAR10-LT reported in the paper. (The uploaded model has 77.63 accuracy. The reported 77.08 is averaged over serveral runs.)

for dout in texture svhn cifar tin lsun places365
do
python test.py --gpu 0 --ds cifar10 --dout $dout \
    --drp <where_you_store_all_your_datasets> \
    --ckpt_path <where_you_save_the_ckpt> \
    --only_id
done

More detailed answer (If you are curious what's the difference between the two versions of test.py)

The old version version of codes stored the main branch (i.e., the anomaly detection branch in stage 1) and the auxiliary branch (i.e., the ID classification branch in stage 2) separately.

The new version of codes updated the BN and FC layers in place at the second stage.

So for the new version model ckpt, the same test script can be used for stage 1 and stage 2 models.

But for the old version model ckpt (i.e., the CIFAR10 model I uploaded), we should use slightly different test codes for stage 1 and 2. Specifically, we use forward function to do inference in stage 1 and forward_aux_classifier in stage 2.

So the new command I provide here is only meant to be used for the CIFAR10 stage 2 model I uploaded to the Google drive. The models obtained using the training script, or the CIFAR100 and ImageNet model I uploaded shouldn't have this issue, since they are trained using the latest code version.

Please let me know if you have any further questions.

Thank you again for your interest in this work!

Haotao

htwang14 commented 1 year ago

Sorry I don't have the authority to push to this repo anymore. I have made a pull request to reflect all necessary code updates. Will let my manager know to merge it tomorrow. :)

htwang14 commented 1 year ago

@HDNU @FengShuai-bupt @liudakai2 @zzuy

The update has been pushed. Thanks!

shuaiNJU commented 1 year ago

@HDNU @FengShuai-bupt @liudakai2 @zzuy

The update has been pushed. Thanks!

@HDNU @FengShuai-bupt @liudakai2 @zzuy

The update has been pushed. Thanks!

Hi, my understanding is that the following commands are used when OOD detection and ID classification are implemented respectively: for dout in texture svhn cifar tin lsun places365 do python test.py --gpu 0 --ds cifar10 --dout $dout \ --drp <where_you_store_all_your_datasets> \ --ckpt_path <where_you_save_the_ckpt> \ done The only difference lies in the ckpt_path used, right?

htwang14 commented 1 year ago

@FengShuai-bupt

For me, the ckpts of the 1st and 2nd stages are saved as cifar10/stage1/latest.pth and cifar10/stage1/stage2/latest.pth, respectively.

To test stage 1 results (i.e., the OOD detection performance), use

for dout in texture svhn cifar tin lsun places365
do
python test.py --gpu 0 --ds cifar10 --dout $dout \
    --drp <where_you_store_all_your_datasets> \
    --ckpt_path cifar10/stage1/latest.pth
done

To test stage 2 results (i.e., the ID classification performance), use

for dout in texture svhn cifar tin lsun places365
do
python test.py --gpu 0 --ds cifar10 --dout $dout \
    --drp <where_you_store_all_your_datasets> \
    --ckpt_path cifar10/stage1/stage2/latest.pth 
done

, or

for dout in texture svhn cifar tin lsun places365
do
python test.py --gpu 0 --ds cifar10 --dout $dout \
    --drp <where_you_store_all_your_datasets> \
    --ckpt_path cifar10/stage1/stage2/latest.pth \
    --only_id
done

if and only if you use the CIFAR10 ckpt I uploaded to Google drive. The --only_id argument is only used in this special case for compatibility with the older version model.

I know this older version CIFAR10 ckpt is causing some confusion. I may replace it with the new version when I get time. :) But you can also run the training codes yourself on CIFAR10 to get the new version ckpt.

shuaiNJU commented 1 year ago

@FengShuai-bupt

For me, the ckpts of the 1st and 2nd stages are saved as cifar10/stage1/latest.pth and cifar10/stage1/stage2/latest.pth, respectively.

To test stage 1 results (i.e., the OOD detection performance), use
for dout in texture svhn cifar tin lsun places365
do
python test.py --gpu 0 --ds cifar10 --dout $dout \
    --drp <where_you_store_all_your_datasets> \
    --ckpt_path cifar10/stage1/latest.pth
done
To test stage 2 results (i.e., the ID classification performance), use
for dout in texture svhn cifar tin lsun places365
do
python test.py --gpu 0 --ds cifar10 --dout $dout \
    --drp <where_you_store_all_your_datasets> \
    --ckpt_path cifar10/stage1/stage2/latest.pth 
done
, or
for dout in texture svhn cifar tin lsun places365
do
python test.py --gpu 0 --ds cifar10 --dout $dout \
    --drp <where_you_store_all_your_datasets> \
    --ckpt_path cifar10/stage1/stage2/latest.pth \
    --only_id
done
if and only if you use the CIFAR10 ckpt I uploaded to Google drive. The --only_id argument is only used in this special case for compatibility with the older version model.

I know this older version CIFAR10 ckpt is causing some confusion. I may replace it with the new version when I get time. :) But you can also run the training codes yourself on CIFAR10 to get the new version ckpt.

Thanks ! I have run the training codes myself on CIFAR10-LT to get the new version ckpt. However, for example, on texture, the testing results of stage 1 (i.e., the OOD detection performance) is:

==texture=== ACC: test_acc:0.7634 (many_acc:0.9290, median_acc:0.7312, low_acc:0.6407) auroc: 0.9249, aupr: 0.8212, fpr95: 0.2440 ACC@FPR0.0: acc_at_fpr_level:0.7634 (many_ac:0.9290, median_acc:0.7312, low_acc:0.6407 | head_acc:0.8634, tail_acc:0.6634) ACC@FPR0.001: acc_at_fpr_level:0.7639 (many_ac:0.9293, median_acc:0.7313, low_acc:0.6416 | head_acc:0.8634, tail_acc:0.6634) ACC@FPR0.01: acc_at_fpr_level:0.7691 (many_ac:0.9322, median_acc:0.7355, low_acc:0.6487 | head_acc:0.8634, tail_acc:0.6634) ACC@FPR0.1: acc_at_fpr_level:0.8079 (many_ac:0.9471, median_acc:0.7783, low_acc:0.6879 | head_acc:0.8634, tail_acc:0.6634)

The testing results of stage 2 (i.e., the ID classification performance) is:

===texture=== ACC@FPR0.0: overall_acc:0.7381 (many_acc:0.9283, median_acc:0.7195, low_acc:0.5727 | head_acc:0.8578, tail_acc:0.6184) ACC@FPR0.001: overall_acc:0.7384 (many_acc:0.9287, median_acc:0.7194, low_acc:0.5732 | head_acc:0.8578, tail_acc:0.6184) ACC@FPR0.01: overall_acc:0.7423 (many_acc:0.9316, median_acc:0.7218, low_acc:0.5788 | head_acc:0.8578, tail_acc:0.6184) ACC@FPR0.1: overall_acc:0.7914 (many_acc:0.9470, median_acc:0.7637, low_acc:0.6536 | head_acc:0.8578, tail_acc:0.6184)

I don't know why the ID classification performance of stage 2 is lower than the results of stage 1. Any suggestion?

htwang14 commented 1 year ago

@FengShuai-bupt For me, the ckpts of the 1st and 2nd stages are saved as cifar10/stage1/latest.pth and cifar10/stage1/stage2/latest.pth, respectively. To test stage 1 results (i.e., the OOD detection performance), use
for dout in texture svhn cifar tin lsun places365
do
python test.py --gpu 0 --ds cifar10 --dout $dout \
    --drp <where_you_store_all_your_datasets> \
    --ckpt_path cifar10/stage1/latest.pth
done
To test stage 2 results (i.e., the ID classification performance), use
for dout in texture svhn cifar tin lsun places365
do
python test.py --gpu 0 --ds cifar10 --dout $dout \
    --drp <where_you_store_all_your_datasets> \
    --ckpt_path cifar10/stage1/stage2/latest.pth 
done
, or
for dout in texture svhn cifar tin lsun places365
do
python test.py --gpu 0 --ds cifar10 --dout $dout \
    --drp <where_you_store_all_your_datasets> \
    --ckpt_path cifar10/stage1/stage2/latest.pth \
    --only_id
done
if and only if you use the CIFAR10 ckpt I uploaded to Google drive. The --only_id argument is only used in this special case for compatibility with the older version model. I know this older version CIFAR10 ckpt is causing some confusion. I may replace it with the new version when I get time. :) But you can also run the training codes yourself on CIFAR10 to get the new version ckpt.
Thanks ! I have run the training codes myself on CIFAR10-LT to get the new version ckpt. However, for example, on texture, the testing results of stage 1 (i.e., the OOD detection performance) is:

==texture=== ACC: test_acc:0.7634 (many_acc:0.9290, median_acc:0.7312, low_acc:0.6407) auroc: 0.9249, aupr: 0.8212, fpr95: 0.2440 ACC@FPR0.0: acc_at_fpr_level:0.7634 (many_ac:0.9290, median_acc:0.7312, low_acc:0.6407 | head_acc:0.8634, tail_acc:0.6634) ACC@FPR0.001: acc_at_fpr_level:0.7639 (many_ac:0.9293, median_acc:0.7313, low_acc:0.6416 | head_acc:0.8634, tail_acc:0.6634) ACC@FPR0.01: acc_at_fpr_level:0.7691 (many_ac:0.9322, median_acc:0.7355, low_acc:0.6487 | head_acc:0.8634, tail_acc:0.6634) ACC@FPR0.1: acc_at_fpr_level:0.8079 (many_ac:0.9471, median_acc:0.7783, low_acc:0.6879 | head_acc:0.8634, tail_acc:0.6634)

The testing results of stage 2 (i.e., the ID classification performance) is:

===texture=== ACC@FPR0.0: overall_acc:0.7381 (many_acc:0.9283, median_acc:0.7195, low_acc:0.5727 | head_acc:0.8578, tail_acc:0.6184) ACC@FPR0.001: overall_acc:0.7384 (many_acc:0.9287, median_acc:0.7194, low_acc:0.5732 | head_acc:0.8578, tail_acc:0.6184) ACC@FPR0.01: overall_acc:0.7423 (many_acc:0.9316, median_acc:0.7218, low_acc:0.5788 | head_acc:0.8578, tail_acc:0.6184) ACC@FPR0.1: overall_acc:0.7914 (many_acc:0.9470, median_acc:0.7637, low_acc:0.6536 | head_acc:0.8578, tail_acc:0.6184)

I don't know why the ID classification performance of stage 2 is lower than the results of stage 1. Any suggestion?

Can you share me the command you used for the 2nd stage training and testing? I'll look into it. In the meantime, you can try the old version model if you just want to get quick testing on the original model.

shuaiNJU commented 1 year ago

@FengShuai-bupt For me, the ckpts of the 1st and 2nd stages are saved as cifar10/stage1/latest.pth and cifar10/stage1/stage2/latest.pth, respectively. To test stage 1 results (i.e., the OOD detection performance), use
for dout in texture svhn cifar tin lsun places365
do
python test.py --gpu 0 --ds cifar10 --dout $dout \
    --drp <where_you_store_all_your_datasets> \
    --ckpt_path cifar10/stage1/latest.pth
done
To test stage 2 results (i.e., the ID classification performance), use
for dout in texture svhn cifar tin lsun places365
do
python test.py --gpu 0 --ds cifar10 --dout $dout \
    --drp <where_you_store_all_your_datasets> \
    --ckpt_path cifar10/stage1/stage2/latest.pth 
done
, or
for dout in texture svhn cifar tin lsun places365
do
python test.py --gpu 0 --ds cifar10 --dout $dout \
    --drp <where_you_store_all_your_datasets> \
    --ckpt_path cifar10/stage1/stage2/latest.pth \
    --only_id
done
if and only if you use the CIFAR10 ckpt I uploaded to Google drive. The --only_id argument is only used in this special case for compatibility with the older version model. I know this older version CIFAR10 ckpt is causing some confusion. I may replace it with the new version when I get time. :) But you can also run the training codes yourself on CIFAR10 to get the new version ckpt.
Thanks ! I have run the training codes myself on CIFAR10-LT to get the new version ckpt. However, for example, on texture, the testing results of stage 1 (i.e., the OOD detection performance) is:

==texture=== ACC: test_acc:0.7634 (many_acc:0.9290, median_acc:0.7312, low_acc:0.6407) auroc: 0.9249, aupr: 0.8212, fpr95: 0.2440 ACC@FPR0.0: acc_at_fpr_level:0.7634 (many_ac:0.9290, median_acc:0.7312, low_acc:0.6407 | head_acc:0.8634, tail_acc:0.6634) ACC@FPR0.001: acc_at_fpr_level:0.7639 (many_ac:0.9293, median_acc:0.7313, low_acc:0.6416 | head_acc:0.8634, tail_acc:0.6634) ACC@FPR0.01: acc_at_fpr_level:0.7691 (many_ac:0.9322, median_acc:0.7355, low_acc:0.6487 | head_acc:0.8634, tail_acc:0.6634) ACC@FPR0.1: acc_at_fpr_level:0.8079 (many_ac:0.9471, median_acc:0.7783, low_acc:0.6879 | head_acc:0.8634, tail_acc:0.6634)

The testing results of stage 2 (i.e., the ID classification performance) is:

===texture=== ACC@FPR0.0: overall_acc:0.7381 (many_acc:0.9283, median_acc:0.7195, low_acc:0.5727 | head_acc:0.8578, tail_acc:0.6184) ACC@FPR0.001: overall_acc:0.7384 (many_acc:0.9287, median_acc:0.7194, low_acc:0.5732 | head_acc:0.8578, tail_acc:0.6184) ACC@FPR0.01: overall_acc:0.7423 (many_acc:0.9316, median_acc:0.7218, low_acc:0.5788 | head_acc:0.8578, tail_acc:0.6184) ACC@FPR0.1: overall_acc:0.7914 (many_acc:0.9470, median_acc:0.7637, low_acc:0.6536 | head_acc:0.8578, tail_acc:0.6184)

I don't know why the ID classification performance of stage 2 is lower than the results of stage 1. Any suggestion?
Can you share me the command you used for the 2nd stage training and testing? I'll look into it. In the meantime, you can try the old version model if you just want to get quick testing on the original model.

Thanks! The stage 2 training:

python stage2.py --gpu 0 --ds cifar10 \ --drp \ --pretrained_exp_str

The stage 2 testing:

for dout in texture svhn cifar tin lsun places365 do python test.py --gpu 0 --ds cifar10 --dout $dout \ --drp \ --ckpt_path <Results/LT_OOD_results/PASCL/cifar10-0.01-OOD30000/ResNet18/e200-b256-adam-lr0.001-wd0.0005-cos_Lambda0.5-Lambda20.1-T0.07-sign-k0.4-trial2/finetune_both_e10-b128-adam-lr0.0005-wd0.0005-cos_LA-tau1/> \ --only_id done

Moreover, I have run the training codes myself on CIFAR10-LT to get several new version ckpt_path, however, the ID classification performance of stage 2 is still lower than the results of stage 1. I don't know why this happens. Could you give me some advice? Thank you very much!

nhewadehigah commented 1 year ago

Hi , I am still getting keyError when loading stage2.pth with below cmd. Do you know why?

for dout in texture svhn cifar tin lsun places365 do python test.py --gpu 0 --ds cifar10 --dout $dout \ --drp \ --ckpt_path cifar10/stage1/stage2/latest.pth done

amazon-science / long-tailed-ood-detection

Results Re-producing #3