Unable to reproduce the results on the Camelyon16 dataset.

Henry0394 commented 3 months ago

I'm attempting to reproduce the results on the Camelyon16 dataset, but I'm obtaining an accuracy rate of around 78%-80%, which is approximately 10 percent lower than the reported accuracy in the paper. Could there be some critical details I've overlooked? Or could there potentially be a bug in the code?

DearCaat commented 3 months ago

It would be helpful to first reproduce the results of classic ABMIL to check your environment and training scripts.

Henry0394 commented 3 months ago

It would be helpful to first reproduce the results of classic ABMIL to check your environment and training scripts.

Thanks, and I did, but the result for ABMIL is still under 80%, it is really weird.

DearCaat commented 3 months ago

That may require double-check the dataset, the experimental environment, and the training codes.

Henry0394 commented 3 months ago

Thanks, but I'm not the only one unable to reproduce the results. As far as I know, several others I'm familiar with have also encountered similar issues. Are there any tricks that aren't mentioned in the code, or could there be bugs within the codebase?

DearCaat commented 3 months ago

Although I would like to share some secret tips with you, the Docker and code I provided are exactly what I used. The code in this repository has been refactored, so it may not reproduce the results of the paper with 100% accuracy, and there might be a deviation of ±1-2%. However, if the deviation exceeds 10%, or if AB-MIL fails to produce valid results, I would find it very strange. If you are using the data, Docker, and code I provided and still see a deviation of over 10%, could you please share the training logs? I will do my best to help resolve the issue.

BUPT-BownZ commented 4 days ago

Hello, I've encountered the same issue. Three months ago, I reproduced this article, and at that time, the evaluation metrics on ABmil were roughly the same as those in the paper. However, when I run the program again now, the metrics for ABmil-MHIM on the C16 dataset are also unable to reach an 80% accuracy rate. During this period, I haven't changed any code. Could you please help me understand what might have caused this issue?

BUPT-BownZ commented 4 days ago

This issue is also the same when running ABmil-MHIM on the TCGA dataset.

DearCaat commented 4 days ago

Hello, I've encountered the same issue. Three months ago, I reproduced this article, and at that time, the evaluation metrics on ABmil were roughly the same as those in the paper. However, when I run the program again now, the metrics for ABmil-MHIM on the C16 dataset are also unable to reach an 80% accuracy rate. During this period, I haven't changed any code. Could you please help me understand what might have caused this issue?

If you are using the data, Docker, and code I provided and still see a deviation of over 10%, could you please share the training command, logs and initialization weights for the teacher model? I will do my best to help resolve the issue.

DearCaat commented 4 days ago

This issue is also the same when running ABmil-MHIM on the TCGA dataset.

Is the AUC of ABMIL on the TCGA also under the 80%?

BUPT-BownZ commented 4 days ago

是的，在TCGA数据集上的表现也很差，这个图上的结果是今天我运行代码后得到的结果。下面这张是我三个月前在TCGA数据上运行ABmil-MHIM得到的结果在teacher模型上我选取的还是之前三个月前训练好得到的的baseline模型——fold_2_model_best_auc.pt 关于ABmil-MHIM具体我的执行命令是 python3 main.py \ --project=tcga_lung_mhim2\ --dataset_root=./datasets/tcga \ --model_path=./output \ --cv_fold=4 \ --val_ratio=0.13 \ --teacher_init=./output/tcga_lung_baseline/abmil/fold_2_model_best_auc.pt \ --title="abmil_105_mr70l20h2-0_mmcos_is" \ --baseline=attn \ --num_workers=0 \ --cl_alpha=0.5 \ --mask_ratio_h=0.02 \ --mrh_sche \ --mm_sche \ --init_stu_type=fc \ --mask_ratio=0.7 \ --mask_ratio_l=0.2 \ --seed=2021 \ --num_workers=0 \ --datasets=tcga

DearCaat commented 3 days ago

是的，在TCGA数据集上的表现也很差，这个图上的结果是今天我运行代码后得到的结果。下面这张是我三个月前在TCGA数据上运行ABmil-MHIM得到的结果在teacher模型上我选取的还是之前三个月前训练好得到的的baseline模型——fold_2_model_best_auc.pt 关于ABmil-MHIM具体我的执行命令是 python3 main.py --project=tcga_lung_mhim2 --dataset_root=./datasets/tcga --model_path=./output --cv_fold=4 --val_ratio=0.13 --teacher_init=./output/tcga_lung_baseline/abmil/fold_2_model_best_auc.pt --title="abmil_105_mr70l20h2-0_mmcos_is" --baseline=attn --num_workers=0 --cl_alpha=0.5 --mask_ratio_h=0.02 --mrh_sche --mm_sche --init_stu_type=fc --mask_ratio=0.7 --mask_ratio_l=0.2 --seed=2021 --num_workers=0 --datasets=tcga

从logit_loss来看，模型应该没有任何的训练，不知道能不能检查关于CE loss的优化情况
建议先不使用任何teacher_init的情况下再进行测试，而且因为是多折交叉验证，不能直接使用其中某一折（fold2）来初始化所有折的模型，这样会导致测试集泄露的问题
从log中的warning可以看出，似乎prediction也是有问题的，不知道能不能检查一下
还有可能就是数据集本身的label是否加载成功，不知道能否检查代码，在这之后，打印出训练的label分布。鉴于你说的你完全没有更改代码，我怀疑可能是数据集label的问题，或者数据集特征读取的问题

DearCaat / MHIM-MIL

Unable to reproduce the results on the Camelyon16 dataset. #20