Unable to reproduce results

kiranchari commented 11 months ago

I installed the conda environment using the provided environment.yml. I am using RedHat OS.

Environment:
    Python: 3.9.7
    PyTorch: 1.13.0+cu117
    Torchvision: 0.14.0+cu117
    CUDA: 11.7
    CUDNN: 8500
    NumPy: 1.19.5
    PIL: 10.0.0

When I run ERM on CelebA without training attributes (CelebA_ERM_attrNo) using hparams_seed=0 and seeds={0,1,2} I get the following results, which are quite different from the paper.

Total records: [93]

-------- Dataset: CelebA, model selection method: test set worst accuracy (oracle)
Algorithm     Avg           Worst         AvgPrec       WorstPrec     AvgF1         WorstF1       Adjusted      Balanced      AUROC         ECE
ERM           94.0 +/- 0.2  67.6 +/- 2.4  85.0 +/- 0.5  71.3 +/- 1.1  88.4 +/- 0.2  80.4 +/- 0.4  87.7 +/- 0.6  93.2 +/- 0.3  98.1 +/- 0.1  4.5 +/- 0.2

-------- Worst-case accuracy, model selection method: test set worst accuracy (oracle)
Algorithm     CelebA        Avg
ERM           67.6 +/- 2.4  67.6

-------- Dataset: CelebA, model selection method: validation set worst accuracy (with attributes)
Algorithm     Avg           Worst         AvgPrec       WorstPrec     AvgF1         WorstF1       Adjusted      Balanced      AUROC         ECE
ERM           94.0 +/- 0.2  67.6 +/- 2.4  85.0 +/- 0.5  71.3 +/- 1.1  88.4 +/- 0.2  80.4 +/- 0.4  87.7 +/- 0.6  93.2 +/- 0.3  98.1 +/- 0.1  4.5 +/- 0.2

-------- Worst-case accuracy, model selection method: validation set worst accuracy (with attributes)
Algorithm     CelebA        Avg
ERM           67.6 +/- 2.4  67.6

-------- Dataset: CelebA, model selection method: validation set worst accuracy (without attributes)
Algorithm     Avg           Worst         AvgPrec       WorstPrec     AvgF1         WorstF1       Adjusted      Balanced      AUROC         ECE
ERM           94.0 +/- 0.2  67.6 +/- 2.4  85.0 +/- 0.5  71.3 +/- 1.1  88.4 +/- 0.2  80.4 +/- 0.4  87.7 +/- 0.6  93.2 +/- 0.3  98.1 +/- 0.1  4.5 +/- 0.2

-------- Worst-case accuracy, model selection method: validation set worst accuracy (without attributes)
Algorithm     CelebA        Avg
ERM           67.6 +/- 2.4  67.6

I am also not able to reproduce Waterbirds_ERM_attrNo or the other datasets. I did not modify the code in subpopbench.

Appreciate any help in reproducing the results!

YyzHarry commented 11 months ago

Hi - thanks for your interest. It seems you are using only one hparams_seed, and that's why all the model selection methods give the same results.

In our implementation, we used 16 different hparams_seed to select the best one. You might want to follow the setting (e.g., see here).

kiranchari commented 11 months ago

Thanks for your reply @YyzHarry. I think it was just a co-incidence that all model selection methods produced the same results above. Based on your answer here https://github.com/YyzHarry/SubpopBench/issues/4, the best hyperparameter is chosen first. The different model selection strategies are used only after training with the best hparams_seed. Is my understanding correct?

Could you please share the best hparams_seed for ERM and other methods? It would be time consuming to run 16 hparams_seeds for all datasets.

YyzHarry commented 11 months ago

Unfortunately we didn't record the best hparams_seed for every algorithm.

YyzHarry / SubpopBench

Unable to reproduce results #8